# DCFemTech Tour de Code: Data Science & Machine Learning Tutorial

### Tutorial Overview
This tutorial is divided into 2 parts. During the first half of the tutorial, we introduce data cleaning and exploratory data analysis. The second half will cover more advanced topics such as Machine Learning.

At any point during the tutorial if you get stuck or need further clarification, please do not hesitate to raise your hand so that a facilitator can help you.

### Dataset
The dataset `census.csv`  was derived from the Census Bureau Database and is a comprehensive record of over 48,000 individuals and their socio-economic information.

### The Challenge
Determine a person's income level based on the socio-economic based on the socio-economic measures given.


## Part 1: Data Cleaning & Exploratory Data Analysis


## Getting Started

<font color=blue> Please install the necessary Python packages before starting this notebook. The packages used are:
- pandas (as pd)
- numpy (as np)
- datetime (as dt)
- plotly (plotly.plotly as py / plotly.graph_objs as go)
- matplotlib.pyplot (as plt)
- seaborn (as sns)
    
</font>

#### STEP 1: IMPORT THE PACKAGES
Import the packages we'll be using during the exercise. This has already been done for you. Just select the cell and click "Run" (or Shift+Return on Macbooks) to execute the package installation. 

Remember, if you're using Anaconda, you'll need to create a new kernel (follow the instructions in the participant's guide).

In [None]:
# This are the packages you need. Simply run this cell.
from IPython.display import display

import pandas as pd
import numpy as np
import datetime as dt

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import *

import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('max_rows', 999) # default is 60 rows

#Always set a random seed to replicate your results
np.random.seed(44)

Now that you have your packages installed, we will load our census data from our csv (comma-separated value) file into a **Pandas DataFrame**. A dataframe is which is a 2-dimensional labeled data structure with columns of potentially different types. Each row has a unique label (the row index), and each column has a unique label (the column index). Simply stated, a dataframe is a table of data, similar to a spreadsheet or SQL table.

A csv file is a text file containing data in table form, where columns are separated using the ‘,’ comma character, and rows are on separate lines. Loading a csv file is made extremely simple with the `.read_csv()` function in Pandas, once you know the path to your file. 

In [None]:
#Run this line to "read-in" your dataset
census = pd.read_csv("datasets/census.csv")

Once you've read in the data, use the `.head()` method, which prints the first 5 rows (by default). You can use this method to inspect just the beginning of the data.

In [None]:
#Run this to see the first 5 rows of your dataset
census.head()

In [None]:
census.head(30) # Prints the first 30 rows

Want to see the last rows of your dataset? Using the `.tail()` method, inspect the end of the data.

In [None]:
#use .tail() method to see the bottowm rows
census.tail()

Use `.columns` to get a list of all the column names. 

In [None]:
census.columns
#where did the () go? 

Looks like some colums have extra white space at the beginning or the end (race, income). Let's quickly remove these leading and trailing whitespaces


In [None]:
#Remove leading or trailing whitespace from columns:
census.columns = census.columns.str.strip()

In [None]:
#Our column names won't work. We strongly recommend removing dashes and only using underscores
census.columns = census.columns.str.replace('-', '_')

The `.shape` command gives information on the size of the dataset. The first number is the number of rows and the second is the number of columns.

#### Q. How many rows and columns are in the data set? 

In [None]:
#input number of rows and columns here

Many dataframes have mixed data types, that is, some columns are integers, some are strings, and some are dates etc. Internally, csv files do not contain information on what data types are contained in each column. Pandas infers the data types when loading the data.

Use the `.dtypes` method to check the types of each column.

*Note: strings are loaded as ‘object’ datatypes.*

In [None]:
#Use the .dtypes method here

Use `.describe()`, `.unique()`, and `.info()` to do more basic exploration before we begin cleaning the data.

In [None]:
# enter .describe(), .unique() and .info() here:

In [None]:
census.describe()

## Selecting Data

We can select a column of data using the name of the column as shown below:

In [None]:
#How to choose a column/series:
census['age']

In [None]:
#In pandas, you can also use a . to select a column/series
census.age

In [None]:
census[['age']] # Selects index and column name

In [None]:
census[['age','sex', 'education']] # Selects multiple columns

Now try it yourself. 
#### Q. Print the occupation column.

In [None]:
#Print occupation column:

#### Q. Print the marital status, relationship, and age columns below:

In [None]:
#Marital status column/series:

In [None]:
#Relationship column/series:

In [None]:
#Age column/series:

## Cleaning the data
Perfect data sets are rare. There are advanced techniques for cleaning messy data sets. The primary goal should be to tidy the data. The two principals of tidy data are as follows:
* Each column represents a variable. Each row represents an observation.
* Similar data grouped together is a dataset.

### Missing Value(s)
Are there any missing data? Any outliers? Shown here we see that the oldest person is **90** years old. Does that make sense? What other areas can you check to see where data is missing or not statistically representative? 

Cleanse the data by removing outlying observations with spurious or erronous data.

In [None]:
#Data cleaning here:

To find the number of nulls per column you can use the `.isnull()` and `.sum()` function:

In [None]:
null_sum = census.isnull().sum() #sum of all the nulls
null_sum

You can also visualize this by plotting the number of nulls per column:

In [None]:
#How to make a bar graph. And how to plot null values

null_sum.plot(kind='bar') # Specifies a bar graph
plt.title('Number of null values per column') # Creates a title for the graph
plt.show() # Displays the graph

Now that we have identified outliers and missing data, let's try removing them. Pandas uses the `.drop()` function to remove rows and columns. 

Let's first try removing a column. To delete a column, use the name of the column, and specify the “axis” as 1 (rows are axis=2)

In [None]:
# Delete the column titled "native-country" and preview the first 5 row.
census.drop("native_country", axis=1).head(5)

Now let's try removing some row. To delete a row, use the index labels, and specify the "axis" as 0.

In [None]:
# Delete the rows with label 2 and preview the first 5 rows.
census.drop([2], axis=0).head(5)

Notice how if you print the dataframe, the previously deleted column and row are still there! 

Take a minute and discuss this with the person sitting next to you as to why this could be.

In [None]:
census.head()
#census.shape  #uncomment census.shape to see how many rows and columns there are

If we want to remove rows which contain missing values, we can use the `.dropna()` function. We can also specify whether we want to drop a row if it contains all NA or at least one NA.

* ‘any’ : If any NA values are present, drop that row or column.
* ‘all’ : If all values are NA, drop that row or column.

In [None]:
#Remove missing values. Then look at the shape to see what was dropped
census = census.dropna(axis=0, how="any")
census.shape

## Exploratory Data Analysis (EDA)

Now that we have prepped the data by cleaning it, let's dive deeper into the data to see what we can find. 

### Summary Statistics

We can use built-in functions to find various types of summary statistics:

In [None]:
census.age.min() # Find the minimum age

In [None]:
census.age.max() # Find the maximum age

In [None]:
census.age.median() # Find the median age

In [None]:
census.age.mean() # Find the average age

### Filtering Data

We can use simple operator comparisons on columns to extract relevant or drop irrelevant information. If we want to filter the data to see only the individuals who are less than 20 years of age we can do the following:

In [None]:
#Find people lesss than 20 years old 
age_filter = census.age < 20 
census[age_filter] # Will filter our dataframe by the condition we specified

Filter the dataframe to see only the individuals who work more than 40 hours per week.

In [None]:
hours_worked_filter = census["hours-per-week"] > 40 #folks who work more than 40 hrs per week
census[hours_worked_filter]

Let's try filtering for only females:

In [None]:
sex_filter = census.sex == "Female" #in Python == means equal. Know the diference between = and ==
census[sex_filter]

In [None]:
census[census.sex == "Female"] # Same as above cell, condensed onto one line

We can combine multiple conditions as well by using the **&** operator:

In [None]:
census[(census.age < 20) & (census.sex == "Female")] #folks who are under age 20 and female

Try it yourself. Work with the person next to you to create your own filter. Be prepared to share your filter.

In [None]:
#Create your own filter here:

### More Exploratory Data Analysis (EDA)

Often while working with Pandas dataframe you might have a column with categorical variables, string/characters, and you want to find the frequency counts of each unique elements present in the column. Pandas’ `.value_counts()` easily let you get the frequency counts.

In [None]:
#Find value counts here
census['education'].value_counts()

#### Q. How many folks have a 9th grade education?

The `.groupby()` method is a very powerful Pandas method. 
You can group by one column and count the values of another column per this column value using the `.value_counts()` method. 

In [None]:
# Provides a value count for marital-status by country
census.groupby('native-country')['marital-status'].value_counts() 

You can use the groupby function in combination with statistical functions as well:

In [None]:
# For each gender, calculate the mean of the numeric columns in the dataframe.
census.groupby('sex').mean()

Try grouping by age and find the mean of numeric columns in the dataframe. Do you notice any trends?

In [None]:
#Use groupby the age and find mean of numeric columns in dataframe.

### NOW IT'S YOUR TURN

Now it's your turn. Use the cheat sheets and your facilitators, as well as help from the other teams and try to answer some of the following questions or create your own questions

* What is the min, max, mean, and median age?
* How many individuals per country?

In [None]:
# Find the value counts of the native country.
# You may be getting an error....let's explain
census['native_country'].value_counts()

## Your Turn:

Now that you've walked through some of getting started with time series and geospatial analysis, spend time exploring the data and doing your own analysis. Consider what questions you'd like answered and begin answering them.

# Part 2: Machine Learning and Predictive Analysis

What kind of predictive analysis can we do on this data? Which countries have incomes below or above $50k? 

Machine learning is a technique that uses the past to guess at what the future will hold. We can apply it here in a variety of ways. There are two main types of machine learning: supervised (where the target is labeled) and unsupervised (where the target is not labeled). We'll work primarily with supervised machine learning for this data set. Beyond that, there are several main tasks for the machines: categorizing data based on it's features or predicting out the future.

![Python's ML Package, Scikit-learn has it's own cheat sheet](http://scikit-learn.org/stable/_static/ml_map.png)
http://scikit-learn.org/stable/_static/ml_map.png

You may have seen while doing your exploratory analysis that some trends are clearer than others. There's always a fuzzy area and machine learnining works to handle that ambiguity in ways that humans are not so great at. Of course, there's plenty of legal and ethical questions surrounding the use of machine learning, as input biases lead to output biases and creating fairness in artificial intelligence is a hot topic, but for now we'll use machines to see if we can draw any new insights from out data.

Here we'll study two different classifiers, and the methods for determining their accuracy. The first is Random Forest Classifier and the second is called a Support Vector Machine. They work in different methods which will be explained further.

In [None]:
#import the splitter and the Machine Learning (ML) algorithms we will use.
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn import metrics
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn import tree
#from sklearn.cross_validation import KFold, cross_val_score
#from sklearn.cross_validation import train_test_split
%matplotlib inline

## Viewing the Data

Before going into machine learning, we have to do some exploratory data analysis to see what features in the machine make a difference to the outcome. Where there are poor correlations, we're likely to see poor performance, where as stronger correlations mean better performance. Below, we'll view the data two different ways -- using a pair plot and a correlation plot heatmap.

In [None]:
census.columns

In [None]:
#Let's see the relationship of our target variable (income) to each other variable:
sns.set(style="ticks")
sns.pairplot(census, hue="income");

In [None]:
#Let's build a correlation matrix (.corr() method). 
#We need to know which features are/aren't correlated with each other. 
#The goal is to feed uncorrelated features into the model.

corr = census.corr()
sns.heatmap(corr, annot = True); #in seaborn, put a ; at the end to remove that funny line
#annot = True means to place the values in each square.

## Preparing the Data
Some models only work with numeric data. That means we must convert all the words (categorical data) to numbers in order to feed it into our models. 

Let's get to it. 

In [None]:
#Run this line to "read-in" your dataset
census = pd.read_csv("datasets/census.csv")

#Remove leading or trailing whitespace from columns:
census.columns = census.columns.str.strip()

#Our column names won't work. We strongly recommend removing dashes and only using underscores
census.columns = census.columns.str.replace('-', '_')

In [None]:
#First take a look at your column names. You have 2 choices: rename them to remove the dashes or only use [] notation for columns
census.columns

In [None]:
census.head()

In [None]:
 data_frame = pd.get_dummies(data_frame, dummy_na=True, drop_first=True)

In [None]:
# Let's encode our columns. Let's make a dataframe for 
dummy_columns = census[['work_class', 'income', 'marital_status', 'race', 'sex', 'relationship', 'native_country', 'education', 'occupation']]

In [None]:
dummy_columns.head()

In [None]:
census.columns

In [None]:
#Let's make a dataframe of our numeric columns. We'll merge this onto the dummy'd columns

In [None]:
numeric_columns = census[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']]

In [None]:
#Let's make magic in one line. 
#Here we concatenate the numeric columns to the dummied column, and drop the first one.
df = pd.concat([numeric_columns, pd.get_dummies(dummy_columns, dummy_na=True, drop_first=True)], axis = 1)
                

In [None]:
df.head()

In [None]:
df.columns

In [None]:
X = df[['workclass_num', 'education_num', 'marital_num', 'race_num', 'sex_num', 'rel_num', 'capital.gain', 'capital.loss']]
y = df.over50K

In [None]:
# To feed our data into our model, we must define our X and y
#X = our features - the useful variables we will use in our model
#y = the target variable - the thing we are trying to predict

In [None]:
#Define your target variable. First.
y = df['income_ >50K']

In [None]:
#Next, define all of the features you'll use in your model. Always drop the y variable.
X = df.drop(['income_ >50K', 'income_nan'], axis=1)

## Logistic Regression

This is one of the most commonly used models. Logistic regresssion allows us to predict classes. In this case, we are predicting whether or not income is greater than 50k

In [None]:
#First we must split our data into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
lr = LogisticRegression()

In [None]:
#Now let's fit our model
lr.fit(X_train, y_train)

### Let's get a score! 

In [None]:
#Now let's score our model. Congratulations, you've built a model. This here is machine learning
print("Training set accuracy: {:.3f}".format(lr.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(lr.score(X_test, y_test)))

### Let's make predictions!

In [None]:
# One only predicts on the test set!
y_pred_class = lr.predict(X_test)

#Now let's put it into a dataframe so we can read the predictions next to the actual values


## Which features are more the most important in the model?

In [None]:
#Plot Logistic regression coefficients

plt.figure(figsize=(20,8))
plt.plot(lr.coef_.T, 'o', label="C=1")
#plt.plot(logreg100.coef_.T, '^', label="C=100")
#plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
plt.xticks(range(X.shape[1]), X, rotation=90)
plt.hlines(0, 0, X.shape[1])
#plt.ylim(-5, 5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()
plt.savefig('log_coef')

## Random Forests

A decision tree is easy to understand -- if this/then this relationships are how most programming languages operate. The problem with such finite decisions is one of overfitting. It doesn't adjust for outlyers because a single tree trains until it's perfect on a single value. 

A random forest is a series of decision trees.  

In [None]:
rf = RandomForestClassifier()

#Now let's fit our model
rf.fit(X_train, y_train)

In [None]:
##Now let's score our random forest model. Looks like this model is ... overfit (dun dun dun)
print("Training set accuracy: {:.3f}".format(rf.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(rf.score(X_test, y_test)))

In [None]:
#What features had the most difference in the training set?
rf.feature_importances_

### Your Turn

What can you get from the data here?

### Your Turn:

Look into various implementations of tree models and give them a try. Do any improve the quality of the data? Also consider removing features or using different feature combinations to optimize your results. Can you get above an 80% accuracy? What happens when you change the test/train split? There are many parameters that can be changed when performing machine learning. Which ones give you the best results?

## Write Up

What did you find during this exercise? What insights would you like to share with the world from doing this exploration? Here is your chance to make conclusions and present your findings. When you're finished, you may submit your kernel/notebook to Kaggle.