# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Machine Learning  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 4 - Making Graphs: Exploratory Data Analysis</div>
<div align="center"> Jonathan Holmes, (he/him)</div>

# Today

- This will be our last class entirely dedicated to coding
- In class 2 we learned about some of the basics of Python
- In class 3 we spend time getting familiar with the Pandas library
- Today we will keep using Pandas and other tools such as [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for data visualization
- This class, I provide all the code (no "follow-along" version), but I still encourage you to play around with code. 

# Exploratory Data Analysis

- What is exploratory data analysis?
- This is usually the last step before the actual analysis (e.g. running regressions)
- This is an opportunity to explore variables: 
    - alone or as they relate to one another
    - using statistics or data visualization

# Dataset
- Today we will use a dataset from the [Village Dynamics Studies in South Asia](https://vdsa.icrisat.org/)
- This dataset contains information of agricultural yields and prices for several crops at the district level in India for a number of years
- This is what's called panel data or longitudinal survey
- We will use a modified version of this dataset
- Let's explore this data together!

In [None]:
# import our packages and use aliases
import numpy as np
import pandas as pd

# Load Dataset
- The dataset is named "vdsa"
- It is a Stata file (ending with _.dta_ extension)
- Name the folder in which the dataset is class3Folder
- use the Pandas dataframe function .read_stata() to open the dataset
- Assign it to df

In [None]:
# Assign path to the folder containing the dataset
class4Folder="~/Dropbox/_teaching/ECO4199/2023/Data-Science-for-Social-Scientists/Class 04 - Making Graphs/"
# read in the data using the read_stata() function and assign it to df
df= pd.read_stata(class4Folder+"vdsa.dta")

What should I do now?

In [None]:
# show number of observations and columns
display(df.shape)
#show information on the number of missing values and data type
#display(df.info())
# describe the dataframe
# display(df.describe().T)
# read head of dataframe
display(df.head())

# Data cleaning
- Before exploring the data further we need to make some data cleaning
- Using info() indicates that non of our data has missing values
    - Clearly there are empty cells when calling head() on df
    - Does negative values for yield or area mean something?
- Also, it seems like prices are stored as strings when clearly they should be numerical

 # Dealing with NaNs
- You may remember that last week we replaced missing values (nan) by zeros
     - this is because we had reasons to believe that missing values meant no actual test
- In this dataset this is probably different:
    - Yields and areas could be missing because the specific crop wasn't grown in the district in that year
    - Yields and areas existed but wasn't recorded
- It turns out in this dataset, missing values were recording as negative 1 (this is bad practice) 

## replace()
- We can use the [replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method to replace -1 by nan
- Because none of the variables could plausibly be negative we can apply replace to the entire dataframe not a single series

In [None]:
# use replace method on the entire dataframe and reassign to df
df=df.replace(-1,np.nan)
# show head()
df.head()

# replace() & to_numeric()
- It also seems that the price columns are preceeded by the currency unit "R" (rupee)
- Let's remove it from the values using the [replace()]() method
- When applied to an object column the replace method should be preceeded by [str](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)
- Then we can change the column type to numeric using [to_numeric()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) function

In [None]:
# remove "R" from the price_maize series
display(df['price_maize'].str.replace("R ",""))
# combine it with to_numeric()
pd.to_numeric(df['price_maize'].str.replace("R ",""))

# In-Class Exercise
For each column starting with price:
- remove R in all the price columns
- cast the data type to numeric
- replace the existing column with by the numeric representation

In [None]:
# create a list of columns starting with price_ using list comprehension
cols = [col for col in df if col.startswith('price')]
# print the head of columns starting with price
display(df[cols].head())
# loop over the column list
for col in cols:
    # replace the values 
    df[col]= pd.to_numeric(df[col].str.replace("R ",""))
    
# print head() of the data frame
df.head()

df.to_pickle(class4Folder+"vdsa_cleaned.pkl")

## Loading cleaned data

- After doing data cleaning, we can save a version of the data. 
- The native format for saving and loading a pandas dataset is called "pkl" or "pickle"
- The name comes from "pickling" food (allowing you to store for later)
- You can actual pickle any Python objects
- The pandas functions are "to_pickle()" and "read_pickle()"

In [None]:
df = pd.read_pickle(class4Folder+"vdsa_cleaned.pkl")

Now that prices are numeric we can call again the describe() method

In [None]:
# call describe() on df
df.describe().round(1).transpose()

# groupby()
- It may be good to know about the number of unique districts
- One way to do this would be to combine groupby() with first() and then check the length

In [None]:
# Get info on the variables
display(df.groupby(['statecode','distcode']).first())
num_district=len(df.groupby(['statecode','distcode']).first())
print(f"There are {num_district} districts in this dataset")

# Data visualization
- Some dimensions of your data beyond min, max, average and SD are better understond when visualized
- We will now talk about data visualization
- Pandas integrates a visualization package named [Matplotlib](https://matplotlib.org/)
- Another useful package is [Seaborn](https://seaborn.pydata.org/)

In [None]:
# import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

## plot()
- There are actually ways to plot directly from Pandas
- This is the [plot()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) method

In [None]:
# Call plot() on the data frame
df.plot()

Clearly this is not readable
Let's try the seaborn [pairplot()](https://seaborn.pydata.org/generated/seaborn.pairplot.html) function

In [None]:
# use the pairplot function on the price columns
cols = [col for col in df if col.startswith('price')]
sns.pairplot(df[cols])

# Matplotlib's basics
- The pairplot is aestictically pleasing but things could be tuned
- You may want to replace the axes titles, the ticks, the colors etc.
- Let's see how to do this using Matplotlib

In [None]:
# Initialize figure and axes
fig, ax = plt.subplots()

#plt.show()

## agg()
- Let's plot some data for the whole of India
- create a new dataframe named india where you group by year and take sum of yields
- To do this we will combine the [agg()](https://pandas.pydata.org/pandas-docs/version/0.23.2/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) method with groupby()

In [None]:
# yearly data on rice and wheat yields
india= df.groupby(['year']).agg(rice=('yield_rice','sum'),wheat=('yield_wheat','sum'))
india.reset_index(inplace=True)
india

- Let's now add the evolution of rice over the years
- You can add this by assigning it to ax

In [None]:
# Plot rice against year
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"])
plt.show()

In [None]:
# Plot rice and wheat against year
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"])
ax.plot(india["year"], india["wheat"])
plt.show()

Let's improve on the presentation and use [markers](https://matplotlib.org/3.1.1/api/markers_api.html) for each data point, change the [line style](https://matplotlib.org/2.0.2/api/lines_api.html#matplotlib.lines.Line2D.set_linestyle) and the [color](https://matplotlib.org/3.1.0/gallery/color/named_colors.html)

In [None]:
# Plot rice and wheat against year
# add a market for each year
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen")
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange")
plt.show()

### Axis labels and title
- It is always a good idea to specify the label and the title of your axis
- You can call set_xlabel() and set_ylabel() on ax for labelling the axes
- You can use set_title() 

In [None]:
# Plot and add labels
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen")
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange")

#label axes
ax.set_xlabel("Year")
ax.set_ylabel("Total Yield")

# title
ax.set_title("Rice and Wheat Production in India")


plt.show()

## legend 
- At this stage, it it impossible to know which line represents which crop
- You can specify the labels as an argument in plot()
- You then need to specify where in the plot you want the label to be positioned using [legend()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html) on plt

In [None]:
# Plot and add labels
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen",label='Rice')
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange",label='Wheat')

#label axes
ax.set_xlabel("Year")
ax.set_ylabel("Total Yield")

# title
ax.set_title("Rice and Wheat Production in India")
# place the legend
plt.legend(loc="upper left")

plt.show()

### ticks

- On the x-label ticks should clearly be integers let's make one tick per year

In [None]:
# Plot and add labels
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen",label='Rice')
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange",label='Wheat')

#label axes
ax.set_xlabel("Year")
ax.set_ylabel("Total Yield")

# title
ax.set_title("Rice and Wheat Production in India")
# place the legend
plt.legend(loc="upper left")

# Redefine the ticks
plt.xticks(np.arange(india['year'].min(), india['year'].max()+1, 1))

plt.show()





### ticks continued
- This is clearly too crowded
- We could rotate the ticks
- Or take large steps

In [None]:
# Plot and add labels
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen",label='Rice')
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange",label='Wheat')

#label axes
ax.set_xlabel("Year")
ax.set_ylabel("Total Yield")

# title
ax.set_title("Rice and Wheat Production in India")
# place the legend
plt.legend(loc="upper left")

# Redefine the ticks and rotate
plt.xticks(np.arange(india['year'].min(), india['year'].max()+1, 1), rotation='vertical')

plt.show()


In [None]:
# Plot and add labels
fig, ax = plt.subplots()
ax.plot(india["year"], india["rice"], marker='o', linestyle="--", color="darkgreen",label='Rice')
ax.plot(india["year"], india["wheat"], marker='v', linestyle=":", color="darkorange",label='Wheat')

#label axes
ax.set_xlabel("Year")
ax.set_ylabel("Total Yield")

# title
ax.set_title("Rice and Wheat Production in India")
# place the legend
plt.legend(loc="upper left")

# Redefine the ticks and rotate
plt.xticks(np.arange(india['year'].min(), india['year'].max()+1, 3), rotation=45)

plt.show()


## Plotting more data
- Let's see how the quantity produced change with prices
- We will create a new dataframe: india_prices
- It will record the average price each year using agg() again

In [None]:
india_prices= df.groupby(['year']).agg(Price=('price_rice','mean'),Pwheat=('price_wheat','mean'),Psorghum=('price_sorghum','mean'))
india_prices.reset_index(inplace=True)
india_prices

## merge()
- Let's use india and india_prices together
- To do so we will use the merge() method
- And save it in india_full

In [None]:
# merge india and india_prices and assign to india_full
india_full=india.merge(india_prices, on='year')
india_full.head()

In [None]:
# initialize a 2x2 plot
fig, axes = plt.subplots(2,2)


In [None]:
fig, axes = plt.subplots(2,2,sharex=True, sharey='row', figsize=(12, 12))
axes[0,0].plot(india_full["year"], india_full["rice"], marker='o', linestyle="--", color="darkgreen",label='Rice')
axes[0,1].plot(india_full["year"], india_full["wheat"], marker='v', linestyle=":", color="darkorange",label='Wheat')
axes[1,0].plot(india_full["year"], india_full["Price"], marker='.', linestyle="--", color="darkgreen",label='Rice')
axes[1,1].plot(india_full["year"], india_full["Pwheat"], marker='.', linestyle=":", color="darkorange",label='Wheat')

axes[1,0].set_xlabel("Year")
axes[1,1].set_xlabel("Year")


axes[1,0].set_xticks(np.arange(india_full['year'].min(), india_full['year'].max()+1, 3))
axes[1,1].set_xticks(np.arange(india_full['year'].min(), india_full['year'].max()+1, 3))


plt.show()

# Scatter plot


In [None]:
fig, axes = plt.subplots(2,1,sharex=False, sharey=True, figsize=(12, 12))
axes[0].scatter(india_full["Price"], india_full["rice"], marker='o', color="darkgreen",label='Rice')
axes[1].scatter(india_full["Pwheat"], india_full["wheat"], marker='v', color="darkorange" ,label='Wheat')



plt.show()

# Distribution of a single variable

- Often you will want to get a sense of the distribution of a single variable
- If you are interested in Bayesian statistics this is a must
- Let's look at the distribution of rice and wheat in our dataset
- A distribution represents - Cumulative distribution gives you prob$(x= X)$

In [None]:
fig, axes = plt.subplots(2,1,sharex=True, sharey=True, figsize=(12, 12))
axes[0].hist(india['rice'])
axes[1].hist(india['wheat'])
plt.show()

This is not a density function (look at the y-axis)
Let's set the argument density to True

In [None]:
fig, axes = plt.subplots(2,1,sharex=True, sharey=True, figsize=(12, 12))
axes[0].hist(india['rice'],bins=30,density=True)
axes[1].hist(india['wheat'],bins=30,density=True)
plt.show()

## Cumulative distribution
- Cumulative distribution gives you prob$(x\leq X)$

In [None]:
fig, axes = plt.subplots(2,1,sharex=True, sharey=True, figsize=(12, 12))
axes[0].hist(india['rice'],bins=30,density=True,cumulative=True)
axes[1].hist(india['wheat'],bins=30,density=True,cumulative=True)
plt.show()

## Making sense of distributions
- So these distributions give us a sense of how widespread the data is
- In our original data there are two sources of variation:
    - Cross sectional: some districts are more productive than others in any given year
    - Inter temporal: some years are more productive than others on average
    
Let's now look at rice in different years for the 2000 decade

In [None]:
# Subset india to the 2000 decade
india_2000=df.query("2000<=year<2010")
# store the unique values for years and the number of years in the dataset
years=india_2000['year'].unique()
num_years=len(years)
print(f"There are {num_years} years in the dataset.")

- We can now plot for each year separately
- The variation will thus capture across districts variations within each year

In [None]:
fig, axes = plt.subplots(num_years,1,sharex=True, sharey=False, figsize=(12, 20))

for i,y in enumerate(np.sort(years)):
    axes[i].hist(india_2000.loc[india_2000.year==y, 'yield_rice'],bins=30, density=True)
    axes[i].set_xlabel(f"{y}")

plt.show()

Let's now look at how the distribution varies by State
- We will first need to aggregate by state and year
- We will then plot a distribution for each State
- This will capture the variation across years within each State separately

In [None]:
# Create new dataframe india_state grouped by year and state with the sum of yield rice and the first value of statename
india_state=df.query("2000<=year<2010").groupby(['year','statecode']).agg(rice=('yield_rice','sum'), State=('statename','first')).reset_index()
display(india_state.head())

# store the unique values for States and the number of States in the dataset
states=india_state['State'].unique()
num_states=len(states)
print(f"There are {num_states} states in the dataset.")

## Nicer plot
- For better looking plots you can use Seaborn
- Seaborn is built on Matplotlib which means you can do with it anything Matplotlib can do and more
- Let's plot the kernel density of rice for the entire dataset
- Kernel density is a way to plot your data in a continous-looking way

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 12))

ax=sns.kdeplot(data=df["yield_rice"], cut=0, color="k")
plt.show()

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 12))
# plot the kdensity for all years and set the color to black
ax=sns.kdeplot(data=df["yield_rice"], cut=0, color="k")
# add a kdensity for each year
for y in np.sort(df.year.unique()):
    ax=sns.kdeplot(data=df.loc[df.year==y,'yield_rice'],alpha=.3, cut=0, legend=False)

plt.show()