# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 3 - Exploratory Data Analysis</div>
<div align="center"> Fabien Forge, (he/him)</div>

# Today

- This will be our last class entirely dedicated to coding
- In class 1 we learned about some of the basics of Python
- In class 2 we spend time getting familiar with the Pandas library
- Today we will keep using Pandas and other tools such as [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for data visualization

# Exploratory Data Analysis

- What is exploratory data analysis?
- This is usually the last step before the actual analysis (e.g. running regressions)
- This is an opportunity to explore variables: 
    - alone or as they relate to one another
    - using statistics or data visualization

# Dataset
- Today we will use a dataset from the [Village Dynamics Studies in South Asia](http://vdsa.icrisat.ac.in/)
- This dataset contains information of agricultural yields and prices for several crops at the district level in India for a number of years
- This is what's called panel data or longitudinal survey
- We will use a modified version of this dataset
- Let's explore this data together!

In [None]:
# import our packages and use aliases


# In class exercise
- The dataset is named "vdsa"
- It is a Stata file (ending with _.dta_ extension)
- Name the folder in which the dataset is class3Folder
- use the Pandas dataframe function .read_stata() to open the dataset
- Assign it to df

In [None]:
# Assign path to the folder containing the dataset
class3Folder=
# read in the data using the read_stata() function and assign it to df
df= 

What should I do now?

In [None]:
# show number of observations and columns
#show information on the number of missing values and data type
# describe the dataframe
# read head of dataframe


# Data cleaning
- Before exploring the data further we need to make some data cleaning
- Using info() indicates that non of our data has missing values
    - Clearly there are empty cells when calling head() on df
    - Does negative values for yield or area mean something?
- Also, it seems like prices are stored as strings when clearly they should be numerical

 # Dealing with NaNs
- You may remember that last week we replaced missing values (nan) by zeros
     - this is because we had reasons to believe that missing values meant no actual test
- In this dataset this is probably different:
    - Yields and areas could be missing because the specific crop wasn't grown in the district in that year
    - Yields and areas existed but wasn't recorded
- It turns out in this dataset, missing values were recording as negative 1 (this is bad practice) 

## replace()
- We can use the [replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method to replace -1 by nan
- Because none of the variables could plausibly be negative we can apply replace to the entire dataframe not a single series

In [None]:
# use replace method on the entire dataframe and reassign to df
# show head()


# replace() & to_numeric()
- It also seems that the price columns are preceeded by the currency unit "R" (roopie)
- Let's remove it from the values using the [replace()]() method
- When applied to an object column the replace method should be preceeded by [str](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)
- Then we can change the column type to numeric using [to_numeric()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) function

In [None]:
# remove "R" from the price_maize series
# combine it with to_numeric()


# In class exercise
For each column starting with price:
- remove R in all the price columns
- cast the data type to numeric
- replace the existing column with by the numeric representation

In [None]:
# create a list of columns starting with price_ using list comprehension

# print the head of columns starting with price

# loop over the column list

    # replace the values 

    
# print head() of the data frame


Now that prices are numeric we can call again the describe() method

In [None]:
# call describe() on df


# groupby()
- It may be good to know about the number of unique districts
- One way to do this would be to combine groupby() with first() and then check the length

In [None]:
# Get info on the variables

# Print out the number of districts



# Data visualization
- Some dimensions of your data beyond min, max, average and SD are better understond when visualized
- We will now talk about data visualization
- Pandas integrates a visualization package named [Matplotlib](https://matplotlib.org/)
- Another useful package is [Seaborn](https://seaborn.pydata.org/)

In [None]:
# import matplotlib and seaborn



## plot()
- There are actually ways to plot directly from Pandas
- This is the [plot()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) method

In [None]:
# Call plot() on the data frame



Clearly this is not readable
Let's try the seaborn [pairplot()](https://seaborn.pydata.org/generated/seaborn.pairplot.html) function

In [None]:
# use the pairplot function on the price columns



# Matplotlib's basics
- The pairplot is aestictically pleasing but things could be tuned
- You may want to replace the axes titles, the ticks, the colors etc.
- Let's see how to do this using Matplotlib

In [None]:
# Initialize figure and axes

# show the result


## agg()
- Let's plot some data for the whole of India
- create a new dataframe named india where you group by year and take sum of yields
- To do this we will combine the [agg()](https://pandas.pydata.org/pandas-docs/version/0.23.2/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) method with groupby()

In [None]:
# yearly data on rice and wheat yields

# reset index inplace



- Let's now add the evolution of rice over the years
- You can add this by assigning it to ax

In [None]:
# Plot rice against year


In [None]:
# Plot rice and wheat against year


Let's improve on the presentation and use [markers](https://matplotlib.org/3.1.1/api/markers_api.html) for each data point, change the [line style](https://matplotlib.org/2.0.2/api/lines_api.html#matplotlib.lines.Line2D.set_linestyle) and the [color](https://matplotlib.org/3.1.0/gallery/color/named_colors.html)

In [None]:
# Plot rice and wheat against year
# add a market for each year


### Axis labels and title
- It is always a good idea to specify the label and the title of your axis
- You can call set_xlabel() and set_ylabel() on ax for labelling the axes
- You can use set_title() 

In [None]:
# Plot and add labels


## legend 
- At this stage, it it impossible to know which line represents which crop
- You can specify the labels as an argument in plot()
- You then need to specify where in the plot you want the label to be positioned using [legend()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html) on plt

In [None]:
# Plot and add labels

#label axes

# title

# place the legend


### ticks

- On the x-label ticks should clearly be integers let's make one tick per year

In [None]:
# Plot and add labels


#label axes


# title

# place the legend


# Redefine the ticks

# show result



### ticks continued
- This is clearly too crowded
- We could rotate the ticks
- Or take large steps

In [None]:
# Plot and add labels


#label axes


# title

# place the legend


# Redefine the ticks and rotate vertically

# show result


In [None]:
# Plot and add labels


#label axes



# title

# place the legend


# Redefine the ticks and rotate 45 degrees



## Plotting more data
- Let's see how the quantity produced change with prices
- We will create a new dataframe: india_prices
- It will record the average price each year using agg() again

In [None]:
# create india_prices, grouped by year with average price

# reset index

#show dataframe


## merge()
- Let's use india and india_prices together
- To do so we will use the merge() method
- And save it in india_full

In [None]:
# merge india and india_prices and assign to india_full



In [None]:
# initialize a 2x2 plot
fig, axes = plt.subplots(2,2)

plt.show()

In [None]:
# plot evolution of yield and prices for rice and wheat


# Scatter plot


In [None]:
# Scatter plot price (x-axis) against yield (y-axis) for rice and wheat separately 


# Distribution of a single variable

- Often you will want to get a sense of the distribution of a single variable
- If you are interested in Bayesian statistics this is a must
- Let's look at the distribution of rice and wheat in our dataset
- A distribution represents - Cumulative distribution gives you prob$(x= X)$

In [None]:
# Plot distribution of rice and wheat side by side


This is not a density function (look at the y-axis)
Let's set the argument density to True

In [None]:
# Plot distribution of rice and wheat side by side with density set to True and 30 bins


## Cumulative distribution
- Cumulative distribution gives you prob$(x\leq X)$

In [None]:
# Plot show cumulative distributions


## Making sense of distributions
- So these distributions give us a sense of how widespread the data is
- In our original data there are two sources of variation:
    - Cross sectional: some districts are more productive than others in any given year
    - Inter temporal: some years are more productive than others on average
    
Let's now look at rice in different years for the 2000 decade

In [None]:
# Subset india to the 2000 decade

# store the unique values for years and the number of years in the dataset


- We can now plot for each year separately
- The variation will thus capture across districts variations within each year

In [None]:
# plot each year separately using a for loop


Let's now look at how the distribution varies by State
- We will first need to aggregate by state and year
- We will then plot a distribution for each State
- This will capture the variation across years within each State separately

In [None]:
# Create new dataframe india_state grouped by year and state with the sum of yield rice and the first value of statename


# store the unique values for States and the number of States in the dataset


## In class exercise
Plot the distribution of rice for each state separately
- use a for loop in which you enumerate over the unique values of states
- label the x-axis with the name of the State in all caps
- make sure that the x-axis is shared by all plots
- figsize should be the tuple (12,30)

In [None]:
# set up fig, axes and pass the relevant arguments

#loop over each state


## Nicer plot
- For better looking plots you can use Seaborn
- Seaborn is built on Matplotlib which means you can do with it anything Matplotlib can do and more
- Let's plot the kernel density of rice for the entire dataset
- Kernel density is a way to plot your data in a continous-looking way

In [None]:
# plot kernel density of rice, set color to black


In [None]:
# plot the kdensity for all years and set the color to black

# add a kdensity for each year
