# DCFemTech Tour de Code: Data Science & Machine Learning Tutorial

### Tutorial Overview
This tutorial is divided into 2 parts. During the first half of the tutorial, we introduce Pandas and exploratory data analysis. The second half will cover more advanced topics such as Machine Learning.

At any point during the tutorial if you get stuck or need further clarification, please do not hesitate to raise your hand so that a facilitator can help you.

### Dataset
The dataset `census.csv`  was derived from the Census Bureau Database and is a comprehensive record of over 48,000 individuals and their socio-economic information.

### The Challenge
Determine a person's income level based on the socio-economic measures given.


## Part 1: Data Cleaning & Exploratory Data Analysis


## Getting Started

Import the packages we'll be using during the exercise. This has already been done for you. Just select the cell and click "Run" (or Shift+Return on Macbooks) to execute the package installation. 

In [None]:
# This are the packages you need. Simply run this cell.
from IPython.display import display

import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('max_rows', 999) # default is 60 rows
pd.set_option('max_columns', 999) #default is 20 columns

# Always set a random seed to replicate your results
np.random.seed(42)

Now that you have your packages installed, we will load our census data from our csv (comma-separated value) file into a **Pandas DataFrame**. A dataframe is which is a 2-dimensional labeled data structure with columns of potentially different types. Each row has a unique label (the row index), and each column has a unique label (the column index). Simply stated, a dataframe is a table of data, similar to a spreadsheet or SQL table.

A csv file is a text file containing data in table form, where columns are separated using the ‘,’ comma character, and rows are on separate lines. Loading a csv file is made extremely simple with the `.read_csv()` function in Pandas, once you know the path to your file. 

In [None]:
# Run this line to "read-in" your dataset
census = pd.read_csv("datasets/census.csv")

Once you've read in the data, use the `.head()` method, which prints the first 5 rows (by default). You can use this method to inspect just the beginning of the data.

In [None]:
# Run this to see the first 5 rows of your dataset
census.head()

In [None]:
census.head(20) # Prints the first 20 rows

Want to see the last rows of your dataset? Using the `.tail()` method, inspect the end of the data.

In [None]:
# use .tail() method to see the bottowm rows
census.tail()

Use `.columns` to get a list of all the column names. 

In [None]:
census.columns
# where did the () go? 

Looks like some colums have extra white space at the beginning or the end (race, income). Let's quickly remove these leading and trailing whitespaces


In [None]:
# Our column names won't work. We strongly recommend removing dashes and only using underscores
census.columns = census.columns.str.replace('-', '_')

In [None]:
# Remove leading or trailing whitespace from columns:
census.columns = census.columns.str.strip()

Let's see how our columns look like now:

In [None]:
census.columns

The `.shape` command gives information on the size of the dataset. The first number is the number of rows and the second is the number of columns.

#### Q. How many rows and columns are in the data set? 

In [1]:
# input function here
# Rows, columns

Many dataframes have mixed data types, that is, some columns are integers, some are strings, and some are dates etc. Internally, csv files do not contain information on what data types are contained in each column. Pandas infers the data types when loading the data.

Use the `.dtypes` method to check the types of each column.

*Note: strings are loaded as ‘object’ datatypes.*

In [None]:
# Use the .dtypes method here
census.dtypes

Use `.describe()`, `.unique()`, and `.info()` to do more basic exploration before we begin cleaning the data.

In [None]:
# enter .describe(), .unique() and .info() here:

## Selecting Data

We can select a column of data using the name of the column as shown below:

In [None]:
# How to choose a column/series:
census['age']

In [None]:
# In Pandas, you can also use a . to select a column/series
census.age

In [None]:
census[['age']] # Selects index and column name

In [None]:
census[['age','sex', 'education']] # Selects multiple columns

Now let's practice with some exercises!
#### Q. Print the occupation column.

In [None]:
# Print occupation column:
census[["occupation"]]

#### Q. Print the marital status, relationship, and age columns below:

In [None]:
# Marital status column/series:

In [None]:
# Relationship column/series:

In [None]:
# Age column/series:

## Cleaning the data
It's exceedingly rare that a clean dataset will be available. Missing values in a dataset is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

To find the number of nulls per column you can use the `.isnull()` and `.sum()` function:

In [None]:
census.isnull().sum() #sum of all the nulls

Optional: You can also visualize this by plotting the number of nulls per column:

In [None]:
# How to make a bar graph. And how to plot null values

null_sum = census.isnull().sum()

null_sum.plot(kind='bar') # Specifies a bar graph
plt.title('Number of null values per column') # Creates a title for the graph
plt.show() # Displays the graph

If we ever need to remove a row or column to clean our data, we can use the `.drop()` function.

Let's first try removing a column. To delete a column, use the name of the column, and specify the “axis” as 1 (rows are `axis=0`)

In [None]:
# Delete the column titled "native_country" and preview the first 5 row.
census.drop("native_country", axis=1).head(5)

Now let's try removing a row. To delete a row, use the index labels, and specify the "axis" as 0.

In [None]:
# Delete the rows with label 2 and preview the first 5 rows.
census.drop([2], axis=0).head(5)

If we want to remove rows which contain missing values, we can use the `.dropna()` function. We can also specify whether we want to drop a row if it contains all NA or at least one NA.

* ‘any’ : If any NA values are present, drop that row or column.
* ‘all’ : If all values are NA, drop that row or column.

In [None]:
# Remove missing values. Then look at the shape to see what was dropped
census = census.dropna(axis=0, how="any")
census.shape

## Exploratory Data Analysis (EDA)

Now that we undertand how to clean our data, let's dive deeper into the data to see what we can find. 

### Summary Statistics

We can use built-in functions to find various types of summary statistics:

In [None]:
census.age.min() # Find the minimum age

In [None]:
census.age.max() # Find the maximum age

In [None]:
census.age.median() # Find the median age

In [None]:
census.age.mean() # Find the average age

**Q. Find the minimum, maximum, median, and mean hours per week.**

In [None]:
# minimum hours per week here

In [None]:
# maximum hours per week here

In [None]:
# median hours per week here

In [None]:
# mean hours per week here

### Filtering Data

We can use simple operator comparisons on columns to extract relevant or drop irrelevant information. If we want to filter the data to see only the individuals who are less than 20 years of age we can do the following:

In [None]:
# Find people lesss than 20 years old 
age_filter = census.age < 20 
census[age_filter] # Will filter our dataframe by the condition we specified

Filter the dataframe to see only the individuals who work more than 40 hours per week.

In [None]:
hours_worked_filter = census["hours_per_week"] > 40 # folks who work more than 40 hrs per week
census[hours_worked_filter]

Let's try filtering for only females:

In [None]:
sex_filter = census.sex.str.strip() == 'Female' # in Python == means equal. Know the diference between = and ==
census[sex_filter]

In [None]:
census[census.sex.str.strip() == 'Female'] # Same as above cell, condensed onto one line

We can combine multiple conditions as well by using the **&** operator:

In [None]:
census[(census.age < 20) & (census.sex.str.strip() =='Female')] #folks who are under age 20 and female

Try it yourself. Work with the person next to you to create your own filter. Be prepared to share your filter.

In [None]:
# Create your own filter here:

### Optional: More Exploratory Data Analysis (EDA)

Often while working with Pandas dataframe you might have a column with categorical variables, string/characters, and you want to find the frequency counts of each unique elements present in the column. Pandas’ `.value_counts()` easily let you get the frequency counts.

In [None]:
# Find value counts here
census['education'].value_counts()

#### Q. How many folks have a 9th grade education?

In [None]:
# Write your answer here

Let's try finding the number of individuals by country.

In [None]:
# Find the value counts of the native country.
census['native_country'].value_counts()

The `.groupby()` method is a very powerful Pandas method. 
You can group by one column and count the values of another column per this column value using the `.value_counts()` method. 

In [None]:
# Provides a value count for marital_status by country
census.groupby('native_country')['marital_status'].value_counts() 

You can use the groupby function in combination with statistical functions as well:

In [None]:
# For each gender, calculate the mean of the numeric columns in the dataframe.
census.groupby('sex').mean()

Try grouping by age and find the mean of numeric columns in the dataframe. Do you notice any trends?

In [None]:
# Use groupby age and find mean of numeric columns in dataframe.

### Optional: Basic Data Visualization

We'll now take a look at the Matplotlib package for visualization in Python. Matplotlib is a multi-platform data visualization library built on NumPy arrays. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats. 

We can use the `.plot()` function to specificy the type of chart we want.

The first type of visualization we can do is a bar chart/plot, which are best for showing numerical comparison across different categories. 

In [None]:
# Count the number of individuals at each education level
census['education'].value_counts()

In [None]:
# Compare with bar plot
census['education'].value_counts().plot(kind='bar');

Scatter plots can be used to show the relationship between two numerical variables. We can compare age and hours per week:

In [None]:
census.plot(kind='scatter', x='age', y='hours_per_week', alpha=0.1); # alpha variable controls transparency

A histogram plot is generally used to summarize the distribution of a data sample. 

In [None]:
ax = census.age.plot(kind='hist', bins=20, title='Histogram of Age'); # Creates the histogram
ax.set_xlabel('Age'); # Sets the label for the x axis
ax.set_ylabel('Frequency'); # Sets the label for the y axis

### NOW IT'S YOUR TURN

Now that you've walked through some of getting started with exploratory data analysis, spend time exploring the data and doing your own analysis.

Use the cheat sheets and your facilitators, as well as help from the other participants, try to answer some of the following questions or create your own questions:

* Find the average capital_gain and capital_loss.
* Find only those individuals who have a Bachelors and are never married. (*Hint*: use filtering methods)
* Optional: Visualize the number of individuals by relationship. (*Hint*: use a bar chart)

# Part 2: Machine Learning and Predictive Analysis

What kind of predictive analysis can we do on this data? Which countries have incomes below or above $50k? 

Machine learning is a technique that uses the past to guess at what the future will hold. We can apply it here in a variety of ways. There are two main types of machine learning: supervised (where the target is labeled) and unsupervised (where the target is not labeled). We'll work primarily with supervised machine learning for this data set. Beyond that, there are several main tasks for the machines: categorizing data based on it's features or predicting out the future.

![Python's ML Package, Scikit-learn has it's own cheat sheet](http://scikit-learn.org/stable/_static/ml_map.png)
http://scikit-learn.org/stable/_static/ml_map.png

You may have seen while doing your exploratory analysis that some trends are clearer than others. There's always a fuzzy area and machine learnining works to handle that ambiguity in ways that humans are not so great at. Of course, there's plenty of legal and ethical questions surrounding the use of machine learning, as input biases lead to output biases and creating fairness in artificial intelligence is a hot topic, but for now we'll use machines to see if we can draw any new insights from out data.

We will be using **Scikit-learn**, a Python library for building machine learning models. Scikit-learn provides a range of supervised and unsupervised learning algorithms

In [None]:
#import the splitter and the Machine Learning (ML) algorithms we will use.
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn import metrics
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn import tree
#from sklearn.cross_validation import KFold, cross_val_score
#from sklearn.cross_validation import train_test_split
%matplotlib inline

## Optional: Viewing the Data

Before going into machine learning, we have to do some more exploratory data analysis to see what features in the machine make a difference to the outcome. Where there are poor correlations, we're likely to see poor performance, where as stronger correlations mean better performance. Below, we'll view the data two different ways -- using a pair plot and a correlation plot heatmap.

In [None]:
census.columns

Let's see the relationship of our target variable (income) to each other variable. We are using the Seaborn library to visualize this.

In [None]:
# Relationship of our target variable to each other variable:
sns.set(style="ticks")
sns.pairplot(census, hue="income");

In [None]:
# Build a correlation matrix (.corr() method). 
# Need to know which features are/aren't correlated with each other. 
# Goal is to feed uncorrelated features into the model.

corr = census.corr()
sns.heatmap(corr, annot = True); #in seaborn, put a ; at the end to remove that funny line
#annot = True means to place the values in each square.

## Preparing the Data
Some models only work with numeric data. That means we must convert all the words (categorical data) to numbers in order to feed it into our models. 

Let's get to it. 

In [None]:
census = pd.read_csv("datasets/census.csv")
census.columns = census.columns.str.strip()
census.columns = census.columns.str.replace('-','_')
census.dtypes
census.head()

We can use Pandas's `.get_dummies()` to convert categorical variable into dummy/indicator variables.

In [None]:
# Extract all the categorical (non-numerical) columns into a new dataframe called "dummy_columns"
dummy_columns = census[['work_class', 'income', 'marital_status', 'race', 'sex', 'relationship', 'native_country', 'education', 'occupation']]

In [None]:
dummy_columns.head()

Now, let's make a dataframe of our **numeric** columns. We'll merge this onto the dummy'd columns

In [None]:
numeric_columns = census[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']]

In [None]:
# Here we concatenate the numeric columns to the dummied column, and drop the first one.
# This new dataframe is called "df"
df = pd.concat([numeric_columns, pd.get_dummies(dummy_columns, dummy_na=True, drop_first=True)], axis = 1)  

In [None]:
df.head()

In [None]:
df.columns

Before we continue, let's define a few key terms:

**Features** are individual independent variables that act as the input in our model. The number of features are called dimensions. 
* We will define this as **X**.

**Target** is whatever the output is of the input variables. It could be the individual classes that the input variables maybe mapped to in case of a classification problem or the output value range in a regression problem. If the training set is considered then the target is the training output values that will be considered.
Simply put, it is the thing we are trying to predict. 
* We will define this as **y**.

In [None]:
# Define your target variable. First.
y = df['income_ >50K']

In [None]:
# Next, define all of the features you'll use in your model. Always drop the y variable.
X = df.drop(['income_ >50K', 'income_nan'], axis=1)

## Logistic Regression

This is one of the most commonly used models. Logistic regresssion allows us **to predict classes**, thus it is a type of *classifier*. In this case, we want to predict whether or not an individual's income is greater than 50k.

First, we must split our data into a **training set** and **testing set**.
* **training set**: the data that the algorithm will "learn" from. Includes the outcome (y variable).
* **testing set**: also called the *validation set*, this is the data used to measure *how well* the model performs at making predictions on that test set. Does **not** include the outcome.


Figuring out how much of your data should be split into your test set is a tricky question. If your training set is too small, then your algorithm might not have enough data to effectively learn. On the other hand, if your test set is too small, then your accuracy and precision could have a large variance. You might happen to get a really lucky or a really unlucky split! In general, putting 80% of your data in the training set, and 20% of your data in the test set is a good place to start.

Using the `train_test_split` method, we generate 4 dataframes:
* `X_train` is the training data set
* `y_train` is the set of labels to all the data in `X_train`
* `X_test` is the test data set
* `y_test` is the set of labels to all the data in `X_test`


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Let's fit!
Now it's time to fit our model on the **training data** using the `fit(X, y)` method.

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(X_train, y_train) 

### Let's get a score! 

Now let's score our model to see how well it performed.

In [None]:
print("Training set accuracy: {:.3f}".format(lr.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(lr.score(X_test, y_test)))

Congratulations, you've built a model. This here is machine learning!

### Let's make predictions!

The `predict(X)` method, given unlabeled observations X (test data set) returns the predicted labels y.

In [None]:
# Predict outcomes
y_pred_class = lr.predict(X_test)

The resulting data is an array. We can convert this into a dataframe using the `pd.DataFrame()` method. 

In [None]:
outcomes = pd.DataFrame(y_pred_class)

In [None]:
outcomes

Show a comparison of our predicted outcomes to the actual values in a dataframe.

In [None]:
y_actual = pd.DataFrame(y)

In [None]:
preds_df = pd.concat([outcomes, df])
preds_df

## Optional: Which features are more the most important in the model?

In [None]:
#Plot Logistic regression coefficients

plt.figure(figsize=(20,8))
plt.plot(lr.coef_.T, 'o', label="C=1")
#plt.plot(logreg100.coef_.T, '^', label="C=100")
#plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
plt.xticks(range(X.shape[1]), X, rotation=90)
plt.hlines(0, 0, X.shape[1])
#plt.ylim(-5, 5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()
plt.savefig('log_coef') # Creates a plot and saves the figure in your working directory

## Optional: Random Forests

Decision trees are a type of model used for both classification and regression. Trees answer sequential questions which send us down a certain route of the tree given the answer. The model behaves with “if this than that” conditions ultimately yielding a specific result.

A Random Forest, a classifier, is simply a collection of decision trees whose results are aggregated into one final result. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models.

In [None]:
rf = RandomForestClassifier()

# Now let's fit our model
rf.fit(X_train, y_train)

In [None]:
# Let's score our random forest model. Looks like this model is ... overfit (dun dun dun)
print("Training set accuracy: {:.3f}".format(rf.score(X_train, y_train)))
print("Test set accuracy: {:.3f}".format(rf.score(X_test, y_test)))

In [None]:
# What features had the most difference in the training set?
rf.feature_importances_

### Optional: Your Turn:

Look into various implementations of tree models and give them a try. Do any improve the quality of the data? Also consider removing features or using different feature combinations to optimize your results. Can you get above an 80% accuracy? What happens when you change the test/train split? There are many parameters that can be changed when performing machine learning. Which ones give you the best results?