# Lab 03 - Data Modeling and Visualisation

Like we discussed this week and last  week, we build models to describe, explain, and predict things in the social world. We use visualisation to explore data, confirm our model, and present results to broader audiences.

To do this, we'll work with the [More Tweets, More Votes](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0079449) data (the original paper is [here](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0079449), if you're curious), and the Iris data we have been using in lab.

First we'll load the the More Tweets, More Votes (MTMV) data. The argument <code>index_col</code> defines one of the existing columns in the file as an index column.

In [None]:
import pandas as pd
import numpy as np
df_mtmv = pd.read_csv("data/mtmv_data_10_12.csv", index_col = 0)

## drop those columns which do not have vote_shape, mshare, or rep_inc
df_mtmv = df_mtmv.dropna(subset = ['vote_share', 'mshare', 'rep_inc'])

In the past lab we described the data frame by understanding the structure of the data. We took simple measures of central tendency and dispersion. Now we can divide those up by a third variable.

To do this, we use a new method called <code>groupby</code> which allows us to group by a variable or set of variables and apply some operation across them.

In [None]:
## vote share and mention share mean 
## by Republican incumbency
gr_mtmv = df_mtmv.groupby('rep_inc')
gr_mtmv[['vote_share', 'mshare']].mean()

In [None]:
## vote share and mention share standard deviation 
## by Republican incumbency
gr_mtmv[['vote_share', 'mshare']].std()

Now we can use a metric like [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) to summarise the associated between two variables.

In [None]:
from scipy.stats.stats import pearsonr

print(pearsonr(df_mtmv['mshare'], df_mtmv['vote_share'])[0])

In [None]:
from statsmodels.formula.api import ols

model = ols("vote_share ~ rep_inc + mshare + pct_white + \
            pct_college + med_hhinc + pct_female", df_mtmv).fit()
model.summary()

### Exercise 1

1. Load the iris dataset (it will be in the same place as the previous lab, 'data/iris.csv'). 
2. Take the mean and standard deviation of the sepal length and sepal width, grouped by species.
3. Take the mean and standard deviation of petal length and petal width by species.
4. Calculate the correlation between sepal length and petal length.

## Visualisation

In [15]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

In [None]:
df_mtmv['mshare'].hist()

In [None]:
df_mtmv['mshare'].plot.kde()

In [None]:
df_mtmv[['mshare', 'vote_share']].hist()

In [None]:
df_mtmv.plot.scatter(x = 'mshare', y = 'vote_share')

In [None]:
ax = df_mtmv[df_mtmv['rep_inc'] == 1].plot.scatter(x = 'mshare', y = 'vote_share', color = 'Red', label = 'Rep')
df_mtmv[df_mtmv['rep_inc'] == 0].plot.scatter(x = 'mshare', y = 'vote_share', color = 'Blue', label = 'Dem', ax = ax)

### Exercise 2

1. Plot the histograms of both sepal length and sepal width.
2. Plot the density plots of both petal length and petal length.
3. Plot a scatter plot of sepal length and sepal width.
4. Plot a scatter plot of sepal length and sepal width, where virginica is Green, setosa is Red, and versicolor is Blue.
5. Do the same as 4 for petal length and width.