# Mice Protein Expression Data Set
## UCI Machine Learning Repository
## Center for Machine Learning and Intelligent Systems

### Abstract

<div style="text-align: justify; LINE-HEIGHT:20px"> The data file ‘Data_Cortex_Nuclear.xls’ was imported into the interactive shell iPython as filename 'AllProtein'. The libraries pandas, matplotlib and numpy were imported as pd, plt and np respectively. Data was then checked for type, sample rows column names and size (found 1080 by 82). Data cleaning was initially performed by selecting for 11 target protein expression attributes, which had previously been found to have high correlations with learning outcomes. The Boolean categorical labels Treatment, Genotype and Behavior were then replaced with binary Boolean values 0 or 1, and the attribute label Class was generated with the resultant binary Boolean labels to produce 8 unique integer Class labels. A scatter matrix of the uncleaned data was checked to provide direction for analysis, followed by removal of outliers and NaN values. Outliers were identified as falling out a 99th percentile range from any raw protein expression data column. During outlier analysis mouse 3484 was identified as having over 60 outlier values, and was removed from the data set. The outliers and NaN values were then filled with a Class mean.<br><br> Following cleaning the data was visualised through a variety of plots to help gain understanding and insight. The data was then modelled with a K-Nearest Neighbour Classification and a Decision Tree Model. The K-Nearest Model worked best with 5 neighbours, distance weighting and Manhattan distance (p = 1). The Decision Tree Models had an average accuracy ranging between 0.79 to 0.98, indicating extremely good predictive capability.</div>

### Introduction

<div style="text-align: justify; LINE-HEIGHT:20px"> Add text </div>

Add table 1

Add figure 1

### Aim

<div style="text-align: justify; LINE-HEIGHT:20px"> Add text </div>

### Data Imporation and Exploration

<div style="text-align: justify; LINE-HEIGHT:20px">The data analysis toolkit pandas (McKinney 2010), the scientific computing package numpy, the 2D plotting library matplotlib (Hunter 2007) and the library for opening URLs urllib2 (core python module) were imported into IPython (Perez & Granger 2007) as pd, np, plt and urllib2 respectively. The Excel file ‘Data_Cortex_Nuclear.xls’ was imported from UCI’s machine learning repository (http://archive.ics.uci.edu/ml/) IPython and named “allProtein”.</div>

In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import urllib2

In [None]:
# Importing the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00342/Data_Cortex_Nuclear.xls"
proteinExpression = pd.read_excel(urllib2.urlopen(url), headers=0)

In [None]:
# Checking the column (attribute) names - exploring data with a list
columnIDlist = proteinExpression.columns.tolist()
print columnIDlist

In [None]:
# Confirming the data types (exploration) - series created and counted
nativeDataTypes = proteinExpression.dtypes
countDataTypes = nativeDataTypes.value_counts()
print countDataTypes

In [None]:
# Checking the value counts of categorical columns
proteinExpression[['Genotype', 'Treatment', 'Behavior', 'class']].apply(pd.Series.value_counts)

In [None]:
# Rename class column with uppercase 
proteinExpression.rename(columns = {'class': 'Class'}, inplace = True)

<div style="text-align: justify; LINE-HEIGHT:20px">The class .describe was used to check the data was successfully imported in the previous step. The output was as expected, containing the columns MouseID, Protein Expression (77 columns), Genotype, Treatment, Behavior and class, as described in the literature (Higuera 2015). The .describe class also gave us the size of data [1080 rows × 82 columns], which was consistent with the description of “Mice Protein Expression Data Set” (Higuera 2015). Due to the size of the data set some columns and rows were not displayed, so the class .columns was used to check the attributes had been imported as expected. The class .dtypes was then used to check the data was imported as the correct types. The objects Genotype, Treatment, Behavior and class were checked with .value_counts() to look for missing values, typos and assess if these were Boolean, with the results in table 2.</div>

In [None]:
CLEAN UP TEXT

<div style="text-align: justify; LINE-HEIGHT:20px">The output above shows no missing values, no typos and shows Genotype, Treatment and Behavior are Boolean. As noted in table 2, the MouseID takes the form tag_n, where n is the number of measurements made. It was noted that the column class contained a lower-case c, which was replaced with an upper-case C and then checked, as below.</div>

<div style="text-align: justify; LINE-HEIGHT:20px"> Add text </div>
<div style="text-align: justify; LINE-HEIGHT:20px"> Add text </div>
<div style="text-align: justify; LINE-HEIGHT:20px"> Add text </div>

In [None]:
# Create new data set with desired proteins
targetProteins = proteinExpression[['MouseID', 'BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                        'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 'ITSN1_N', 'SOD1_N', 
                        'GFAP_N', 'Genotype', 'Treatment', 'Behavior', 'Class']]

In [None]:
# Stats check
targetProteinsStats = targetProteins.describe()
print targetProteinsStats

In [None]:
# Checking for missing values
missingValueCheck = targetProteins[['BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                        'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 
                        'ITSN1_N', 'SOD1_N', 'GFAP_N']].isnull()
missingValueList = missingValueCheck.apply(pd.Series.value_counts)
print missingValueList

In [None]:
# Housekeeping
pd.options.mode.chained_assignment = None

In [None]:
# Removing string categories
targetProteins['Genotype'].replace('Control', '1', inplace=True)
targetProteins['Genotype'].replace('Ts65Dn', '0', inplace=True)
targetProteins['Genotype'] = targetProteins['Genotype'].astype(int)
targetProteins['Behavior'].replace('C/S', '1', inplace=True)
targetProteins['Behavior'].replace('S/C', '0', inplace=True)
targetProteins['Behavior'] = targetProteins['Behavior'].astype(int)
targetProteins['Treatment'].replace('Saline', '1', inplace=True)
targetProteins['Treatment'].replace('Memantine', '0', inplace=True)
targetProteins['Treatment'] = targetProteins['Treatment'].astype(int)

In [None]:
# Confirming alteration
targetProteins[['Behavior', 'Genotype', 'Treatment']].apply(pd.Series.value_counts)

In [None]:
# Converting binary to unique identifying numbers
def change_class(row):
    row['Class'] = row['Genotype'] * 4 + row['Behavior'] * 2 + row['Treatment']
    return row

In [None]:
# Applying above change to target proteins data
targetProteins = targetProteins.apply(change_class, axis=1)

In [None]:
# Checking value counts
targetProteins['Class'].value_counts()

In [None]:
# Will give back a normalized (mean = 0) box plot with outliers
targetProteinsBoxPlot = pd.DataFrame(np.random.randn(1080, 11), columns=['BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                        'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 'ITSN1_N', 'SOD1_N', 'GFAP_N'])
targetProteinsBoxPlot.boxplot(figsize=(4,4), vert=False)

In [None]:
# GIVE TITLE    
from pandas.tools.plotting import scatter_matrix
scatterProteins = targetProteins[['MouseID', 'BRAF_N', 'pERK_N', 'S6_N', 
                                  'pGSK3B_N', 'CaNA_N', 'CDK5_N', 'pNUMB_N', 
                                  'DYRK1A_N', 'ITSN1_N', 'SOD1_N', 'GFAP_N']]
scatter_matrix(scatterProteins, alpha=0.2,figsize=(16,16),diagonal='hist')
plt.show()
# Update diagonal
# Update title
# Alter cell padding / visualisation?

In [None]:
# ADD DESCRIPTION / TITLE
proteinNames = ['BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 'CaNA_N', 'CDK5_N', 'pNUMB_N', 
                'DYRK1A_N', 'ITSN1_N', 'SOD1_N', 'GFAP_N']
miceIDs = proteinExpression['MouseID'].str.split('_').apply(pd.Series, 1)[0].unique()
MakeMouseID = proteinExpression['MouseID'].str.split('_').apply(pd.Series, 1)[0].unique()

In [None]:
# DESCRIPTION + 'Where SD = Standard Deviation'
proteinRows = []
for proteinName in proteinNames:
    count = targetProteins[proteinName].count()
    mean = targetProteins[proteinName].mean()
    sd = targetProteins[proteinName].std()
    minusThreeSD = mean - (3 * sd)
    minusTwoSD = mean - (2 * sd)
    twoSD = mean + (2 * sd)
    threeSD = mean + (3 * sd)
    outliers = targetProteins.query(proteinName + ' < ' + str(minusThreeSD) + 
                                    ' | ' + proteinName + ' > ' + str(threeSD))

    row = {'Protein': proteinName,'Count': count,'Mean': mean,'SD': sd,
           '-3SD': minusThreeSD,'-2SD': minusTwoSD,'+2SD': twoSD,
           '+3SD': threeSD,'Outliers': outliers[proteinName].count()}
    proteinRows.append(row)

nnPctRangeDF = pd.DataFrame(proteinRows, index=proteinNames, 
                            columns=['Count', 'Mean', 'SD', '-3SD', 
                                     '-2SD', '+2SD', '+3SD', 'Outliers'])
#nnPctRangeDF shows us the number of outliers count for each protein

In [None]:
# Add description
print nnPctRangeDF

In [None]:
# DESCRIPTION + 'Where SD = Standard Deviation'
outlierMiceRows = []
for proteinName in proteinNames:

    mean = targetProteins[proteinName].mean()
    sd = targetProteins[proteinName].std()
    minusThreeSD = mean - (3 * sd)
    threeSD = mean + (3 * sd)
    outliers = targetProteins.query(proteinName + ' < ' + str(minusThreeSD) + 
                                    ' | ' + proteinName + ' > ' + str(threeSD))

    if outliers.empty:
        row = {'Protein': proteinName,'MouseID': '-','# Instances': '-','Genotype': '-',
           'Treatment': '-','Behavior': '-','Class': '-'}
        outlierMiceRows.append(row)
    else:
        for mouseID in miceIDs:
            mouseOutlierRows = outliers[outliers['MouseID'].str.contains(mouseID)]
            if not mouseOutlierRows.empty:
                    row = {'Protein': proteinName,'MouseID': mouseID,
                           '# Instances': len(mouseOutlierRows),
                           'Genotype': mouseOutlierRows['Genotype'].iloc[0],
                           'Treatment': mouseOutlierRows['Treatment'].iloc[0],
                           'Behavior': mouseOutlierRows['Behavior'].iloc[0],
                           'Class': mouseOutlierRows['Class'].iloc[0]}
                    outlierMiceRows.append(row)

outliersDF = pd.DataFrame(outlierMiceRows, 
                          columns=['Protein', 'MouseID','# Instances', 
                                   'Genotype','Treatment', 'Behavior', 'Class'])

In [None]:
# Add description
print outliersDF

In [None]:
# Stripping all outliers back to NaN
def make_nans(row):
    for proteinName in proteinNames:
        mean = targetProteins[proteinName].mean()
        sd = targetProteins[proteinName].std()
        minusThreeSD = mean - (3 * sd)
        threeSD = mean + (3 * sd)
    
        if row[proteinName] < minusThreeSD or row[proteinName] > threeSD:
            row[proteinName] = None
        return row
targetProteins = targetProteins.apply(make_nans, axis=1)

In [None]:
# Checking for missing values - native outliers stripped to NaN
missingValueCheck = targetProteins[['BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                        'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 
                        'ITSN1_N', 'SOD1_N', 'GFAP_N']].isnull()
missingValueList = missingValueCheck.apply(pd.Series.value_counts)
print missingValueList

Discuss finding above...

In [None]:
# Removal of mouse 3484_n
targetProteins = targetProteins[~targetProteins['MouseID'].str.contains('3484')]
indexOfMouse = np.where(miceIDs=='3484')[0]
miceIDs = np.delete(miceIDs, indexOfMouse)

In [None]:
# Convert all remaining NaNs to the average value of that protein for that class
def make_averages(row):
    for proteinName in proteinNames:
        if np.isnan(row[proteinName]):
            average = targetProteins[targetProteins.Class == row['Class']][proteinName].mean()
            row[proteinName] = average
    return row
targetProteins = targetProteins.apply(make_averages, axis=1)

In [None]:
# Checking for missing values
missingValueCheck = targetProteins[['BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                        'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 
                        'ITSN1_N', 'SOD1_N', 'GFAP_N']].isnull()
missingValueList = missingValueCheck.apply(pd.Series.value_counts)
print missingValueList

In [None]:
# ADD TITLE
scatterProteins = targetProteins[['MouseID', 'BRAF_N', 'pERK_N', 'S6_N', 'pGSK3B_N', 
                                  'CaNA_N', 'CDK5_N', 'pNUMB_N', 'DYRK1A_N', 'ITSN1_N', 
                                  'SOD1_N', 'GFAP_N']]
scatter_matrix(scatterProteins, alpha=0.2,figsize=(16,16),diagonal='hist')
plt.show()
# Apply changes as above

In [None]:
# Write this dataframe to a CSV file for later use
targetProteins.to_csv("finalData.csv")