<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/courses/ds4b-m1-1-intro/notebooks/s1-manipilation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

# Introduction

In this session, you will learn the basic grammar of data manipulation, some best-practice advices. Since data manipulation always follows a purpose and requires some understanding of the data at hand, we will also have a first glance data exploration and visualization. However, we will her only cover the very basics and skip most of the details. Again, you will have a dedicated sessions lateron.

In this session, you will learn:

* How to do basic variable filtering, selection, and manipulation
* How to create various types of data summarization
* How to also apply these actions on grouped data
* How to join data from different sources
* How to reshape (pivot) your data

From my experience, this covers ca. 80% of common data manipulation tasks. Sound like fun? Lets get started!




## 2.2. Object classes


In [None]:
# [] initiate a list

v1 = [1,5,11,33]
v1

In [None]:
v2 = ["hello","world"]
v2

In [None]:
v3 = [True, True, False, True]
v3

Combining different types of elements in one vector will coerce the elements to the least restrictive type when using R

In python you obtain a list of lists with all elements in their original format

In [None]:
v4 =[v1, v2, v3, 'boo']
v4

Integers (numbers) are still numbers, not strings (text). Easy to see because they don't have ' '

Element-wise operations: 
Are not possible with lists in the same way as in R. Addition will just lead to appending lists.
Yet, you can achieve the same functionality using numpy.arrays rather than lists.

In [None]:
v1 + v3

In [None]:
np.array(v1) + np.array(v3)

NumPy is a library, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Here, you can already see that R comes from Maths and Stats, while Python is a CS language.

In [None]:
# Same for multiplication
v1 * 2

To do math, we need to transform the list into an array

In [None]:
v1_array  = np.array(v1)

v1_array * 2

In [None]:
# What works in R, doesn't necessarily work in Python (probably there is some way to get there but just 
# running + gives you an error)
v1_array + np.array([1,7])

In [None]:
# that works the same way
sum(v1)

For more maths you need to engage numpy or other modules (Python is not a maths language)

In [None]:
np.mean(v1)

In [None]:
# Standard deviation for population - DeltaDegreesOfFreedom = 0 by default
np.std(v1, ddof=0)

In [None]:
# This will give you the same as R

np.std(v1, ddof=1)

In [None]:
np.corrcoef(v1_array,v1_array*5)

In [None]:
v1_array > 2

The majority of the contents before 2.2.5 can be found in this cheat sheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

## 2.2.5 Data Frames

In Python Data Frames are managed by Pandas, a very comprehensive library for data manipulation and analysis

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [None]:
# We construct the DF from a dictionary which is indicated by {'some_key':['some_values']}

dfr1 = pd.DataFrame(
    {'ID':range(1,5), # Python counts from 0 and the last value in a range is excluded
     'FirstName':["Jesper","Jonas","Pernille","Helle"],
     'Female':[False,False,True,True],
     'Age':[22,33,44,55]
})

In [None]:
# Python doesn't really do much factors and as you can see pandas understood your input formats
dfr1.info()

In [None]:
dfr1.FirstName #dot notation

In [None]:
dfr1['FirstName'] #more traditional subsetting

In [None]:
dfr1.loc[:,'FirstName'] #more complex subsetting

In [None]:
dfr1.iloc[:,1] #index based

In [None]:
# Rows 1 and 2, columns 3 and 4 - the gender and age of Jesper & Jonas
dfr1.iloc[[0,1],[2,3]]


In [None]:
#Same thing
dfr1.loc[[0,1],['Female','Age']]

In [None]:
# Rows 1 and 3, all columns

dfr1.iloc[[0,2],:] # don't forget to count index-1 when going from R to python

In [None]:
#Find the names of everyone over the age of 30 in the data
dfr1[dfr1.Age > 30]

In [None]:
# or "Query style" (There are always many ways of doing the same thing)

dfr1.query('Age > 30')

## 2.2.6 Flow Control (loops & friends)

Python is made for readability and therefor tabs and new lines have syntax meaning


In [None]:
x = 5 
y = 10

if (x==0):
  y = 0 
else:
  y = y/x  
  print(y)

In [None]:
for i in range(1,x+1):
  print("OMG, i just counted to " + str(i))

In [None]:
while x > 0:
  print(x) 
  x = x-1

In [None]:
while True: 
  print(x)
  x = x + 1
  if x > 7:
    break

Python does not have pipes. Yet, much of the piping is very similar to dot "." in Python

In [None]:
starwars = pd.read_csv("https://sds-aau.github.io/SDS-master/M1/data/characters_starwars.csv")

Python does not offer the same consistant verb and pipe grammer. But it's OK :-)

More on that here: https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07

In [None]:
# filter

starwars[starwars['species'] == 'Droid']

In [None]:
# select

starwars[['name','homeworld']].head(10)

In [None]:
starwars.drop(['birth_year','skin_color'], axis=1).head(10)

In [None]:
# not as pretty as in R but hey...we get there...and who wants to select that way?

fancy_columns = [x for x in starwars.columns if x.endswith('color')]

starwars[['name'] + fancy_columns].head(10)

In [None]:
# arrange in python is sort_values homeworld-ascending, mass-descending
starwars.sort_values(by=['homeworld', 'mass'], ascending=[True, False]).head(10)

mutate in R is a bit weird from a python point of view. Let's try
There are many ways to accomplish that in python

In the example Daniel calculates BMI and mass.rel for all characters
In python you can use `map` departing from a single column and `apply` departing from a whole dataframe functions for that

we can combine these two with so-called lambda functions (anonymous fuctions). They have a strange syntax but are nice


In [None]:
# Complicated but good for more complex stuff

starwars['bmi'] = starwars.apply(lambda x: x['mass']/(x['height'] / 100)**2, axis=1) #x is here one row of the DF 

In [None]:
# easy!

starwars['bmi'] = starwars['mass'] / (starwars['height'] /100)**2

In [None]:
starwars['mass_rel'] = starwars['mass'] / starwars['mass'].max()

In [None]:
rng = starwars.loc[:,'name':'mass'].columns.to_list() #some trickery

starwars.loc[:,rng+['bmi','mass_rel']].sort_values('bmi', ascending=False).head(10)

In [None]:
#summarize

print(starwars['height'].min())
print(starwars['height'].mean())
print(starwars['height'].max())
print(starwars['height'].std())

In [None]:
# group_by

starwars.groupby(by='homeworld')['height'].mean().sort_values(ascending=False).head(10)

## 4 Case Study: Cleaning up historical data on voting of the United Nations General Assembly

In [None]:
# R has RDS, in Python we have parquet and a bunch of other stuff
votes = pd.read_parquet("https://sds-aau.github.io/SDS-master/M1/data/votes.pq")


In [None]:
votes.head()

In [None]:
votes.vote.unique()

In [None]:
# 4.3.1
votes[votes['vote'] <= 3].head()

In [None]:
# 4.3.2
votes['session'] + 1945

In [None]:
!pip install countrycode

In [None]:
from countrycode import countrycode
# let's measure how long it takes
%time countries = countrycode.countrycode(votes.ccode[:100], origin='cown', target='country_name')

The package is a bit slow and thus perhaps it is easier to speed up things by transforming only the unique country-codes and then just merge them back (this is a bit of a deviation from the R notebook)

In [None]:
unique_countrycodes = votes.ccode.unique()
uniques_countries = countrycode.countrycode(unique_countrycodes, origin='cown', target='country_name')

In [None]:
lookup_df = pd.DataFrame({'ccode' : unique_countrycodes,
                          'country_name': uniques_countries})

lookup_df.head()

In [None]:
#adding the countries to the initial data
# this will be covered later

votes.merge(lookup_df, how='left').head(10)

In [None]:
# bringing all together

votes = votes[votes.vote <= 3]
votes['year'] = votes['session'] + 1945
votes = votes.merge(lookup_df, how='left') # left merge on ccode
votes = votes.sort_values(['year','rcid','ccode'])


In [None]:
votes.reset_index(drop=True,inplace=True)

In [None]:
votes.head(10)

## 4.4 Generating first insights

In [None]:
# not entirely sure where some votes went missing as compared to R
# find it our, ppl
len(votes[votes.vote == 1]) / len(votes)

In [None]:
# Using nice built in stuff

votes.vote.value_counts(normalize=True)

In [None]:
votes.groupby('year')['vote'].value_counts(normalize=True)

The rest of this section has ben covered in many assignments and the python EDA lecture

descriptions <- readRDS("data/UN_votes_descriptions.rds")

In [None]:
!wget https://github.com/SDS-AAU/M1-2019/raw/master/data/UN_votes_descriptions.rds

In [None]:
!pip install pyreadr

In [None]:
import pyreadr

In [None]:
result = pyreadr.read_r('UN_votes_descriptions.rds')

In [None]:
descriptions = result[None]

In [None]:
descriptions['year'] = pd.to_datetime(descriptions.date).dt.year

In [None]:
votes_joined = votes.merge(descriptions, how='inner')
votes_joined.drop(['ccode','date','session','unres'], axis=1, inplace=True)
votes_joined.head()

In [None]:
us_col_perc = votes_joined[(votes_joined.country_name == 'DENMARK')].groupby('year')['vote'].value_counts(normalize=True)
us_col_perc.loc[:,1].plot()

In [None]:
countries = ["United States", "China", "France", "Denmark"]

countries = [country.upper() for country in countries] # Upper casing
print(countries) # just display - no function


votes_joined[votes_joined['country_name'].isin(countries)].head()

In [None]:
# Yes, python can do pretty plots, too

import seaborn as sns
sns.set(style="darkgrid")

sns.set(rc={'figure.figsize':(11.7,8.27)})

In [None]:
countries_perc_yes = votes_joined[votes_joined['country_name'].isin(countries) ].groupby(['year', 'country_name'])['vote'].value_counts(normalize=True)


In [None]:
to_plot = countries_perc_yes.loc[:,:,1]

sns.lineplot(x = to_plot.index.get_level_values(0), y = to_plot.values, hue = to_plot.index.get_level_values(1))

## Tidying our data