In [1]:
import numpy as np
import pandas as pd
import os
import sklearn
import missingno as msno

# Reading Data

In [2]:
def segmentWords(s): 
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

#### Create a Dataframe containing the counts of each word in a file

In [28]:
d = []

for c in os.listdir("data_training"):
    directory = "data_training/" + c
    for file in os.listdir(directory):
        words = readFile(directory + "/" + file)
        e = {x:words.count(x) for x in words}
        e['__FileID__'] = file
        val = -1
        if directory == 'data_training/pos':
            val = 1
        elif directory == 'data_training/neg':
            val = 0
        e['__CLASS__'] = val
        d.append(e)

Create a dataframe from d - make sure to fill all the nan values with zeros.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html


In [30]:
df = pd.DataFrame(data=d)
# Just to prove that there are some numbers in the dataset
print(df.shape)
print(sum(pd.isnull(df["they"])))
df = df.replace('NaN', 0)

sm = 0
for index, row in df.iterrows():
    sm += sum(pd.isnull(row))
    
print(sm)
print(df.describe())

(1600, 45673)
447
0
                        earth     goodies          if      ripley  \
count  1600.000000  1600.000000  1600.000000  1600.000000  1600.000000   
mean      0.000625     0.000625     0.000625     0.000625     0.000625   
std       0.025000     0.025000     0.025000     0.025000     0.025000   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     0.000000     0.000000     0.000000     0.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

          suspend        they      white                         \
count  1600.000000  1600.000000  1600.000000  1600.000000  1600.00000   
mean      0.000625     0.000625     0.000625     0.003125     0.00750   
std       0.025000     0.025000     0.025000     0.074958     0.25492   
min       0.000000   

#### Split data into training and validation set 

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [34]:
def shuffleAndSplit(data):
    length = .8 * data.shape[0]
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    return (data.iloc[indices[:length]], data.iloc[indices[length:]])

#Shuffle original dataframe and split into test
# and validation sets
dfTest, dfVal = shuffleAndSplit(df)
print(dfTest.shape, dfVal.shape)

((1280, 45673), (320, 45673))




* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [36]:
#Test data and labels
testData = dfTest.drop('__CLASS__', axis=1)
testLabels = dfTest['__CLASS__']
print(testData.shape, testLabels.shape)
#Validation datasets
valData = dfVal.drop('__CLASS__', axis=1)
valLabels = dfVal['__CLASS__']
print(valData.shape, valLabels.shape)

((1280, 45672), (1280,))
((320, 45672), (320,))


# Logistic Regression

#### Basic Logistic Regression
* Use sklearn's linear_model.LogisticRegression() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#### Changing Parameters

#### Feature Selection
* In the backward stepsize selection method, you can remove coefficients and the corresponding x columns, where the coefficient is more than a particular amount away from the mean - you can choose how far from the mean is reasonable.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.std.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html

How did you select which features to remove? Why did that reduce overfitting?

# Single Decision Tree

#### Basic Decision Tree

* Initialize your model as a decision tree with sklearn.
* Fit the data and labels to the model.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


#### Changing Parameters
* To test out which value is optimal for a particular parameter, you can either loop through various values or look into sklearn.model_selection.GridSearchCV

References:


http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

How did you choose which parameters to change and what value to give to them? Feel free to show a plot.

Why is a single decision tree so prone to overfitting?

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


#### Changing Parameters

What parameters did you choose to change and why?

How does a random forest classifier prevent overfitting better than a single decision tree?