# Course: Intro to Python & R for Data Analysis
## Lecture: Let's get this data started - Pandas
Professor: Mary Kaltenberg

Fall 2020

contact: mkaltenberg@pace.edu

About me: www.mkaltenberg.com

## Objectives:

Part 1 (Quickstart Guide):
- dataframes
- how to import data into a dataframe
- merge
- drop columns
- combine numpy AND pandas
- how to export data

Part 2 (Detailed Guide):
- concat
- transform and pivot
- groupby
- hierarchial indexing
- aggregate



<img src ='https://media.giphy.com/media/z6xE1olZ5YP4I/giphy.gif' >

Pandas technically comes from "Panel Data." I prefer the dancing pandas.

In [1]:
# Let's import the package
import pandas as pd

# Data Frames + Importing

We made it. We are finally at the point at importing real data and doing something with it!  Wahoooo!

So, pandas works with dataframes. It's technically an object. Those familiar with object based programming will be familiar with this concept, but well, it is what it seems. It is a thing, and object, that you can manipulate.

The first step on this journey is to import data. 

I've now uploaded data that we can use for today's exercise:

- world justice project 
- general inequality dataset

First step is to import the data and name it a variable.

The read csv funciton is:
`pd.read_csv()`

There is also `pd.read_table()` (typically for text files)

or import json into a dataframe directly with `pd.read_json()`

There is also an option to read stata files:

`from pandas import read_stata`

`pd.read_stata()`

Note of honesty about stata.

In [None]:
# use the tab function to read other types of data
pd.read_

So, one note. When you are managing a bunch of files, which you will do. You will want those files organized neatly and not just floating around so it's impossible to find. 

Generally, I create a folder root that I operate from. And from that point a few folders that I can move throughout the process.

To remember my root, I just create a variable and name the path. There are a bunch of ways of doing this (there is a function called path that you can use, but I'm stuck in my ways at this point.)

In [None]:
path = '/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/'
ds = '/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/Lecture_6/DS/'

#you can name it whatever or use how ever many folders and organization you want. 
# Just be organized or you will regret it later

In [None]:
#this is another way to organize files - pathlib has a lot of features 
# that can be useful when you want to recursively open a variety of data files and append them

from pathlib import Path, PureWindowsPath
p = Path('/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/')

#For those in windows, you can also use this to convert filenames
#mac uses forward slash and windows uses backslash in directories - this difference causes chaos
print(PureWindowsPath(ds+'wjp.csv'))

In [None]:
wjp = pd.read_csv('/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/Lecture_6/DS/wjp.csv') 


In [None]:
wjp = pd.read_csv(ds+'wjp.csv',index_col=0) 
ineq = pd.read_csv(ds+'ineq.csv') 

#here you can see I am using the variable path names I created so I can easily access the information 
# instead of writing the entire path name out

In [None]:
wjp = pd.read_csv(ds+'wjp.csv') 

In [None]:
# you can learn some general things about the dataframe
#what columns are in it
wjp.columns

In [None]:
#what the first few rows looks like 
wjp.head()

In [None]:
# or specify the rows

wjp.head(20) # first 20

### Trouble shooting

Some data is trickier, though. 

Some trouble shooting issues.

Not all CSV are created equal. Python reads UTF-8 files. Sometimes, you may have to export your file so that it is 'UTF-8' csv

<img src ='utf-8csv.png' width=500 >

When you import, not all files are csv. You may have different separators and delimiters.

Seperators separate values into different cells. Delimiters create new rows (they mark the end of a row).

Often, the easiest thing to do is use the `encoding` option when importing a csv. Usually, `latin` enconding works to fix the problem.

In [None]:
#encoding option in python
pd.read_csv('wjp.csv',sep = ',', encoding = 'latin1')

In [None]:
cd '/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/Data_Analysis_Python_R/Lecture_6/DS/'

In [None]:
# Sometimes you may have trouble importing it and you have no idea why. 
# A first step is to check what the beginning contents look like:
!head 'wjp.csv'

#Looking at the first few lines can indicate how it is separated and where/what are the headers

In [None]:
pd.read_csv('wjp.csv',sep = ',')

#the optional argument sep will let you pick the particular separator for your data

In [None]:
pd.read_csv('wjp.csv',sep = ',', header = 0)
# You may also have to tell pandas what row is the header [remember index 0]

In [None]:
# so let's take a look at a weird example
pd.read_csv('E8081RQI.TXT')
# could also import with pd.read_table('E8081RQI.TXT')
#it's not separated by commas

In [None]:
!head 'E8081RQI.TXT'

In [None]:
pd.read_csv('E8081RQI.TXT', sep = '\s+')

# \t = tab
# \s = space
# \s+ = many spaces

In [None]:
pd.read_csv?

In [None]:
#we can choose which columns we to include and import without a header
test_data = pd.read_csv('E8081RQI.TXT', sep = '\s+', usecols=(range(0,5)), header=None)

#We can rename columns
test_data.columns='dataset','variable','population','GDP','Income'
test_data

In [None]:
# we can also drop anything in the file that is missing
pd.read_csv('E8081RQI.TXT', sep = '\s+').dropna()

In [None]:
# careful - it will drop entire rows that have one na
test_data.dropna()

In [None]:
pop_data = test_data.dropna()

In [None]:
#or we can replace missing values with whatever we want
pd.read_csv('E8081RQI.TXT', sep = '\s+').fillna(0)

In [None]:
# test_data = pd.read_csv('E8081RQI.TXT', sep='\s+')
test_data[test_data.isnull()]

In [None]:
pd.read_csv?

sometimes you may get a mysterious 'Unnamed: 0' - often this is from python and it's an index (usually the first column)
You can do away with this by setting this column as the index
 
`pd.read_csv(filename, index_col = 0)`

Also, you can import from a clipboard by copy and pasting (but there can be errors, so be careful)

`pd.read_clipboard()`

Or from a pdf

``` python
from tabula import read_pdf
df = read_pdf('test.pdf', pages='all')
```

## Merging

Often you'll want to combine datasets. Typically, you will pool together a variety of datasets into one dataframe.

Currently we have two dataframes: wjp and ineq

We will merge the two. There are different ways that you can merge.

Other "adding together" dataframes include: `pd.append()` (hey you know that!) and `pd.concat()`

They can be useful in different scenarios.

By far, though, `pd.merge()` is your BFF. 

In [None]:
wjp.columns

In [None]:
ineq.columns

In [None]:
# First, I need to take a look at the column names and see what I want to merge by. In this case it is country.
pd.merge(wjp, ineq, left_on=['Country','year'],right_on=['country','year'], how='left')

OK! What happened? What is this wizardry?

 The first two are the dataframes you want to merge. Only two can be merged at a time.
 
left_on is the key that will match with the left dataframe
right_on is the key that will match with the right dataframe
pandas merge will look for EXACT matched between the keys

how it matches depends onthe argument "how"

In this example, it will only look at the keys on the left and will match with items on the right so long as it is in the key of the left

By default merge does an 'inner' join; the keys in the result are the intersection (if you didn't put "how"). In general, don't ever rely on the default. You'll lose stuff - be aware of how you merge.

There are four ways to join
<img src ="merge_joins.png">

Remember union vs. intersection
<img src ="uion_intersection.png">

From this [great website](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) to check out


Which we will do right now.


<img src = "merge_options.png">
From P4DS

### Pandas + Stuff

Everything you learned so far can be applied to your dataframe. EVERYTHING.

Which means, this is the part of the class where you will go in breakout groups and be totally lost for like 5-10 minutes and then figure out how to do it. 

First, some hints.

In [None]:
wjp_ineq = pd.merge(wjp, ineq, left_on=['Country','year'],right_on=['country','year'],how='left')

In [None]:
wjp_ineq.columns

In [None]:
#filter 
data = wjp_ineq[['Country', 'Region','Income Group', 'year', 'isocode',
                 'factor1', 'factor2','factor3','factor4', 'factor5','factor6',
                 'factor7','factor8','gini', 'population']]

#this filters in the same way that we have seen filtering in the past

# when I import and merge, I usually leave the original variable the same 
# so that I can always reference it if I make a mistake in merging or something else

#The double brackets mean, keep everything that is within the identified columns. You choose the columns.

In [None]:
data

In [None]:
# I can list out the unique values of any column
wjp_ineq['Country'].unique()

In [None]:
wjp_ineq['Country'].nunique()
#or count them

In [None]:
#And filter
wjp_ineq[wjp_ineq['Country']=='Uruguay']

#filtering is the same way I have been showing you all along - boolean searches/ 
# You can filter by values as well or anything that I taught you before 
# (just be sure it's the same type)

In [None]:
wjp_ineq[wjp_ineq['gini']>.3]

In [None]:
# or any of the numpy functions
wjp_ineq['gini'].std()

In [None]:
# We can get fancy in our filtering and cleaning data

#for each cell in this object I have, data, find strings that contain the values 'factor'
# and store that information in factor_frame
factor_frame = [x for x in list(data) if x.startswith('factor')]
#here's another way to do the same thing
factor_frame = [x for x in list(data) if 'factor' in x]

# Here's a subset of the data using the list we created from our loop
#We want to include a few other columns besides factor by using extend to add items in the list we created
factor_frame.extend(['Country', 'Income Group', 'Region', 'gini', 'population'])

factor_frame = data[factor_frame]

In [None]:
#How many countries? (temporarily store this value for future use)
nc = data['isocode'].nunique()
#How many observations? 
print('Number of Observations in ds:', len(data)) #print in the output with string and some information you just calculated in your output box
#How many years? 
print('Number of years in ds:', data['year'].nunique())
#drop a column that is unneeded
c = data.drop(['year'],1)

In [None]:
#dropping rows instead of columns
#resetting index to Region so that I can drop all rows in the index that are 'Eastern Europe & Central Asia'
c = c.set_index('Region')
c.drop('Eastern Europe & Central Asia',0)

In [None]:
# I can also reset indexes
c = c.reset_index()

In [None]:
factor_frame['total'] = factor_frame['factor1']+factor_frame['factor2'] +factor_frame['factor3']+factor_frame['factor4']+factor_frame['factor5']+factor_frame['factor6']+factor_frame['factor7']+factor_frame['factor8']
# create a variable that adds up all of the factors        
factor_frame['factor_avg'] =  factor_frame['total'].mean()  #create a new variable with the mean of the toal
factor_frame['total'] = factor_frame['total'].astype(float) #store the column total as a floar
# factor_frame = factor_frame.drop(['total'],1) #drop the column total

## Exporting

We can easily export dataframes at any time with:

`df.to_csv(filename.csv)`


I almost always use the option argument of index to set it to false. Typically, I don't need to index to travel

`dataframe.to_csv('filename.csv', index = False)`

In [None]:
cd '/Users/mkaltenberg/Documents/Data Analysis Python R Lectures/'

In [None]:
pwd

In [None]:
# this will export to the file location that you are currently in.
# So, be careful - know where you are in your directory.
factor_frame.to_csv('factors_frame_avg.csv',index=False)

## I hope all of that time spent on python functions are coming together for some magic.

### Now it's your turn!

<img src ='https://media.giphy.com/media/citBl9yPwnUOs/giphy.gif' width = 300>

More useful tips from pandas at this [website](https://www.dataschool.io/python-pandas-tips-and-tricks/#readingfiles)

## Breakout Groups

In [None]:
# practice exercise
# get to this point:

wjp = pd.read_csv('wjp.csv') 
ineq = pd.read_csv('ineq.csv') 
wjp_ineq = pd.merge(wjp, ineq, left_on=['Country','year'],right_on=['country','year'],how='left')

1. Filter out the dataset to show only data from the region 'Sub-Saharan Africa'
2. How many countries are in the region?
3. Calculate the average gini of the region.
4. What's the maximum population in the region? What's the countries name?
5. Can you do a for loop that can do this calculation for all of the regions?
6. Export the filtered dataset of 1 (only countries that are from the region)


# Part Two