## Exercises

### Data Acquisition & Summary


1. Use seaborn to load the iris data set into a dataframe, `df_iris`

    - print the first 3 rows 
    
    - print the number of rows and columns (shape)
    
    - print the column names
    
    - print the data type of each column
    
    - print the summary statistics for each of the numeric variables.  Would you recommend rescaling the data based on these statistics?
    
    
2. Read the data tab from the stats module dataset, Excel_Stats.xlsx, into a dataframe, `df_excel`
    
    - assign the first 100 rows to a new dataframe, `df_excel_sample`
    
    - print the number of rows of your original dataframe
    
    - print the first 5 column names
    
    - print the column names that have a data type of `object`
    
    - compute the range for each of the numeric variables.
    
    
3. Read train.csv from google drive (shared through classroom in topic 'Classification') into a dataframe labeled `df_google`
    
    - print the first 3 rows 
    
    - print the number of rows and columns
    
    - print the column names
    
    - print the data type of each column
    
    - print the summary statistics for each of the numeric variables
    
    - print the unique values for each of your categorical variables


4. In mysql workbench or a terminal, write a query to select all the columns of titanic_db.passengers. Export that table to a csv you store locally.  Read that csv into a dataframe `df_csv`. 

    - print the head and tail of your new dataframe
    
    - print the number of rows and columns 
    
    - print the column names
    
    - print the data type of each column
    
    - print the summary statistics for each numeric variable
    
    - print the unique values for each categorical variables.  If there are more than 5 distince values, print the top 5 in terms of prevelence or frequency. 

### Using df_iris

Scenario: A local flower shop is trying to identify the species of the iris that are distributed to their shop to sell.  They tags identifying the species are missing.  They want to both *understand* the differences in these species (exploratory analysis stage) and have a model that will label the species for these plants that arrived not labeled.  

#### Data Preparation

Compute 2 new variables, add to your existing columns, and assign to a new dataframe `df_with_area`: sepal_area, petal_area


#### Data Exploration

1. Split data into train (70%) & test (30%) samples.  You should end with 2 data frames: `train_df` and `test_df`

> `train_df, test_df = train_test_split(df_with_area, test_size = .30,  
                                random_state = 123,    
                                stratify = df[['species']])`   
                               
                               
2. Create a swarmplot where the x-axis is each of the independent variable names (petal_length, petal_width, etc).  The y-axis is the value of the variable.  Use color to represent species as another dimension.  Hint: You will to 'melt' the dataframe into a 'long' dataframe in order to accomplish this.  What are your takeaways from this visualization?  

> `sns.set(style="whitegrid", palette="muted")   
    train_melt = pd.melt(train_df, "species", var_name="measurement")   
    plt.figure(figsize=(8,6))  
    sns.swarmplot(x="measurement", y="value", hue="species",   
        palette=["r", "c", "y"], data=train_melt)`   
              

> We can see a clear separation within both petal length and petal width for the 3 species. Sepal length shows a separation of setosa from the other two.  But virginica and versicolor are pretty mixed in there.  If we calculate sepal area and petal area, maybe we can find more space between the 3 species.


2. Create 4 subplots (2 rows x 2 columns) of scatterplots 
    - sepal_length x sepal_width 
    - petal_length x petal_width
    - sepal_area x petal_area
    - sepal_length x petal_length
    - Make your figure size 14 x 8.  What are your takeaways?


> `plt.figure(figsize=(14,8))  
    plt.subplot(2,2,1)  
    plt.scatter(x='sepal_length', y='sepal_width', hue='species',   
        palette=["r", "c", "y"], data=train_df)     
    plt.title('Sepal Length & Width => Species?')`  

> `plt.subplot(2,2,2)   
    plt.scatter(x='petal_length', y='petal_width', hue='species',  
       palette=["r", "c", "y"], data=train_df)  
    plt.title('Petal Length & Width => Species?')`   

> `plt.subplot(2,2,3)  
    plt.scatter(x='sepal_area', y='petal_area', hue='species',    
        palette=["r", "c", "y"], data=train_df)  
    plt.title('Area of Sepals & Petals  => Species?')`   

> `plt.subplot(2,2,4)   
    plt.scatter(x='sepal_length', y='petal_length', hue='species',   
        palette=["r", "c", "y"], data=train_df)   
    plt.title('Length of Sepals & Petals => Species?')`   


3. Create a heatmap of each variable layering correlation coefficient on top.  

> `plt.figure(figsize=(8,6))   
    sns.heatmap(train.corr(), cmap='Blues', annot=True)`  


4. Create a scatter matrix visualizing the interaction of each variable 

> `from pandas.tools.plotting import scatter_matrix   
    from matplotlib import cm    
    cmap = cm.get_cmap('gnuplot')   
    scatter = pd.scatter_matrix(train, marker = 'o', s=40, hist_kwds={'bins':15},  
        figsize=(9,9), cmap = cmap)`


5. Is the sepal length significantly different in virginica than versicolor? Run an experiment to test this. 

    - must include null hypothesis, alternative hyp, t-test, results, summary 
    
    - H0: the difference in sepal length between virginica and versicolor is insignificant.
    
    - Ha: the difference in sepal length between virginica and versicolor is substantial.
    
    - We will test if the sepal length of virginica is significantly different than that of the versicolor.
    
    - If there is difference, then variable sepal_length is a good choice to keep as a feature.
    
    - We can use a t-test here, as sepal_length is somwhat normally distributed.    


> `import scipy as sp    
    import numpy as np   
    sp.stats.ttest_ind(train.dropna()[train['species']=='virginica']['sepal_length'],    
        train.dropna()[train['species']=='versicolor']['sepal_width'])`  
                   
> We 'fail to confirm the null hypothesis' that there is no difference in `sepal_length` between versica and verginica species.  Therefore, it is a good idea to keep `sepal_length` as a feature. 



#### Data Modeling

##### Logistic Regression

2. Fit the logistic regression classifier to your training sample and transform, i.e. make predictions on the training sample
3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
5. Look in the scikit-learn documentation to research the `solver` parameter.  What is your best option(s) for the particular problem you are trying to solve and the data to be used? 
6. Run through steps 2-4 using another `solver` (from question 5) 
7. Which performs better on your in-sample data?
8. Save the best model in `logit_fit`

##### Decision Tree

2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample) 
3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
5. Run through steps 2-4 using entropy as your measure of impurity. 
7. Which performs better on your in-sample data?
8. Save the best model in `tree_fit`

##### KNN

2. Fit the K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample) 
3. Evaluate your results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
5. Run through steps 2-4 setting k to 10
6. Run through setps 2-4 setting k to 20
7. What are the differences in the evaluation metrics?  Which performs better on your in-sample data? Why?
8. Save the best model in `knn_fit`

##### Random Forest

2. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 20.  
3. Evaluate your results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
6. Run through steps increasing your min_samples_leaf to 5 and decreasing your max_depth to 3.  
7. What are the differences in the evaluation metrics?  Which performs better on your in-sample data?  Why?  
8. Save the best model in `forest_fit`

##### Test

Once you have determined which algorithm (with metaparameters) performs the best, try reducing the number of features to the top 4 features in terms of information gained for each feature individually. That is, how close do we get to predicting accurately the species with each feature?  

1. Compute the information gained. 
2. Create a new dataframe with top 4 features (`train_df_reduced`).  
3. Use the top performing algorithm with the metaparameters used in that model. Create the object, fit, transform on in-sample data, and evaluate the results.  Compare your evaluation metrics with those from the original model (with all the features).  Select the best model. 
4. Run your final model on your out-of-sample dataframe (`test_df`). Evaluatethe results.  