In [1]:
import numpy as np
import pandas as pd
import os
import sklearn
import missingno as msno

# Reading Data

In [2]:
def segmentWords(s): 
    return s.split()

def readFile(fileName):
    # Function for reading file
    # input: filename as string
    # output: contents of file as list containing single words
    contents = []
    f = open(fileName)
    for line in f:
        contents.append(line)
    f.close()
    result = segmentWords('\n'.join(contents))
    return result

#### Create a Dataframe containing the counts of each word in a file

In [3]:
d = []

for c in os.listdir("data_training"):
    directory = "data_training/" + c
    for file in os.listdir(directory):
        words = readFile(directory + "/" + file)
        e = {x:words.count(x) for x in words}
       # e['__FileID__'] = file
        val = -1
        if directory == 'data_training/pos':
            val = 1
        elif directory == 'data_training/neg':
            val = 0
        e['__CLASS__'] = val
        d.append(e)

Create a dataframe from d - make sure to fill all the nan values with zeros.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html


In [48]:
df = pd.DataFrame(data=d)
# Just to prove that there are some numbers in the dataset
print(df.shape)
print(sum(pd.isnull(df["they"])))
df = df.fillna(0)

sm = 0
for index, row in df.iterrows():
    sm += sum(pd.isnull(row))
    
print(sm)
print(df.describe())

(1600, 45672)
447
0
                        earth     goodies          if      ripley  \
count  1600.000000  1600.000000  1600.000000  1600.000000  1600.000000   
mean      0.000625     0.000625     0.000625     0.000625     0.000625   
std       0.025000     0.025000     0.025000     0.025000     0.025000   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     0.000000     0.000000     0.000000     0.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

          suspend        they      white                         \
count  1600.000000  1600.000000  1600.000000  1600.000000  1600.00000   
mean      0.000625     0.000625     0.000625     0.003125     0.00750   
std       0.025000     0.025000     0.025000     0.074958     0.25492   
min       0.000000   

#### Split data into training and validation set 

* Sample 80% of your dataframe to be the training data

* Let the remaining 20% be the validation data (you can filter out the indicies of the original dataframe that weren't selected for the training data)

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [49]:
def shuffleAndSplit(data):
    length = int(.8 * data.shape[0])
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    return (data.iloc[indices[:length]], data.iloc[indices[length:]])

#Shuffle original dataframe and split into test
# and validation sets
dfTest, dfVal = shuffleAndSplit(df)
dfTest = pd.DataFrame(data=dfTest)
dfVal = pd.DataFrame(data=dfVal)
print(dfTest.shape, dfVal.shape)

(1280, 45672) (320, 45672)


* Split the dataframe for both training and validation data into x and y dataframes - where y contains the labels and x contains the words

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [50]:
#Test data and labels
testData = dfTest.drop('__CLASS__', axis=1)
testLabels = dfTest['__CLASS__']
print(testData.shape, testLabels.shape)
#Validation datasets
valData = dfVal.drop('__CLASS__', axis=1)
valLabels = dfVal['__CLASS__']
print(valData.shape, valLabels.shape)

(1280, 45671) (1280,)
(320, 45671) (320,)


# Logistic Regression

#### Basic Logistic Regression
* Use sklearn's linear_model.LogisticRegression() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [51]:
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty='l2')
model.fit(testData, testLabels)
print(model.score(valData, valLabels))

0.84375


#### Changing Parameters

In [52]:
#Lets adjust C for more regularization
model = LogisticRegression(penalty='l2', C=3)
model.fit(testData, testLabels)
print(model.score(valData, valLabels))
model = LogisticRegression(penalty='l2', C=8)
model.fit(testData, testLabels)
print(model.score(valData, valLabels))
#We can see that adding a regularization term does not seem to help
# increase accuracy slightly, indicating that the inaccuracy
# was not caused by

0.84375
0.84375


#### Feature Selection
* In the backward stepsize selection method, you can remove coefficients and the corresponding x columns, where the coefficient is more than a particular amount away from the mean - you can choose how far from the mean is reasonable.

References:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.std.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html

In [54]:
weights = model.coef_[0]
total = sum(weights)
print(len(weights))
norm = np.asarray(weights) / total
bad = [x for x in range(len(norm)) if norm[x] < .00000001]
print(len(bad))

f = open('/home/porrster/Data-Science-Decal-Fall-2017/Assignments/Project2_DSDFa17/badcols', 'w')

model = LogisticRegression(penalty='l2', C=8)
model.fit(testData, testLabels)
errBase = model.score(valData, valLabels)
cols = testData.columns
batchsize = int(len(bad) * .01)
for i in range(len(bad) // batchsize):
    testData0 = testData.drop(cols[(i*batchsize):((i+1)*batchsize)], axis=1)
    valData0 = valData.drop(cols[(i*batchsize):((i+1)*batchsize)], axis=1)
    model.fit(testData0, testLabels)
    err = model.score(valData0, valLabels)
    if errBase <= err:
        for count in range(batchsize):
            ind = (i * batchsize) + count
            testData0 = testData.drop(cols[ind], axis=1)
            valData0 = valData.drop(cols[ind], axis=1)
            model.fit(testData0, testLabels)
            err = model.score(valData0, valLabels)
            if errBase <= err:
                testData = testData0
                valData = valData0
    print(i)
for i in range(len(bad) - (len(bad) // batchsize) * 100):
    testData0 = testData.drop(cols[(i*batchsize):((i+1)*batchsize)], axis=1)
    valData0 = valData.drop(cols[(i*batchsize):((i+1)*batchsize)], axis=1)
    model.fit(testData0, testLabels)
    err = model.score(valData0, valLabels)
    if errBase <= err:
        testData = testData0
        valData = valData0
f.write(testdata.columns)
f.close()
print(len(testData))    
  

45671
24604


FileNotFoundError: [Errno 2] No such file or directory: '/home/porrster/Data-Science-Decal-Fall-2017/Assignments/Project2_DSDFa17/badcols'

How did you select which features to remove? Why did that reduce overfitting?

In [55]:
cols = testData.columns
#keepCols = [cols[x] for x in keep]
#testDataClean = testData[keepCols[:]]
#valDataClean = valData[keepCols[:]]

print(testData.shape, testLabels.shape)
model = LogisticRegression(penalty='l2', C=1)
model.fit(testData, testLabels)
print(model.score(valData, valLabels))

(1280, 45671) (1280,)
0.84375


# Single Decision Tree

#### Basic Decision Tree

* Initialize your model as a decision tree with sklearn.
* Fit the data and labels to the model.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


In [87]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', max_depth=100, max_leaf_nodes=500)

classifier.fit(testData, testLabels)
classifier.score(valData, valLabels)

0.64687499999999998

#### Changing Parameters
* To test out which value is optimal for a particular parameter, you can either loop through various values or look into sklearn.model_selection.GridSearchCV

References:


http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [129]:
# Select hyperparameters to test
max_depths = [10, 20, 30, 50, 75, 100]
leaf_nodes = [20, 40, 60, 100, 150, 200, 300, 500]
scores = []
# For each value of the hyperparameters, train and score a model.
for i in range(len(max_depths)):
    scores.append([])
    for j in  range(len(leaf_nodes)):
        temp_classifier = DecisionTreeClassifier(criterion='entropy', 
                                            max_depth=max_depths[i], max_leaf_nodes=leaf_nodes[j])
        temp_classifier.fit(testData, testLabels)
        
        scores[i].append(temp_classifier.score(valData, valLabels))
        


In [132]:
for i in range(len(scores)):
    for j in range(len(scores[i])):
        print(round(scores[i][j], 4), end='  ')
    print("\n")

# Best score is 0.6625, multiple sets of hyperparameters hit this maximum in the dataset.
# Notably when the number of leaf nodes is small, likely prevents overfitting.
# Perhaps there is something better, but based on these results it will not be by much.
# And is going to be very hard to find.

0.6625  0.6469  0.6406  0.6469  0.6438  0.6406  0.65  0.6562  

0.6625  0.6469  0.6438  0.6  0.6312  0.6312  0.6406  0.6188  

0.6625  0.6438  0.6438  0.6125  0.6312  0.6438  0.6188  0.6156  

0.6625  0.6469  0.6438  0.6219  0.6406  0.6312  0.6531  0.6375  

0.6625  0.6469  0.6469  0.6406  0.6188  0.6188  0.6594  0.6406  

0.6625  0.6469  0.6469  0.6188  0.6562  0.6219  0.6062  0.6312  



How did you choose which parameters to change and what value to give to them? Feel free to show a plot.

We chose to explore the number of leaf nodes and the maximum depth of the tree. The two are intrinsically related but offer some interesting interplay. The number of leaf nodes was particularly interesting, as increasing the number of leaf nodes led certainly to lower validation accuracy. We hypothesize this is due to overfitting, since the tree will split too many times outside of what is strictly necessary and instead overfit to the training data. 

Why is a single decision tree so prone to overfitting?

A single decision tree overfits on its training data because it grows organically from its inputs and forms a rigid structure around it. Particularly in this dataset, the decision tree gets a overwhelming amount of features and has a lot of trouble sorting out the useful ones from the noise. One possible thing to explore is limiting max_features, making sure that not too many features are considered at once.

In [97]:
# Using best hyperparameters from last part

for i in range(1, 101, 10):
    limited_classifier = DecisionTreeClassifier(criterion='entropy', 
                                                max_depth=10, max_leaf_nodes=100, max_features=i)

    limited_classifier.fit(testData, testLabels)

    print(limited_classifier.score(valData, valLabels))


0.503125
0.509375
0.565625
0.54375
0.59375
0.634375
0.55625
0.559375
0.496875
0.61875


Changing max_features, contrary to what we expected, did not improve validation accuracy. We could venture the hypothesis that the tree is not in fact overfitting, but rather failing to adequately capture the data in a meaningful way.

# Random Forest Classifier

#### Basic Random Forest

* Use sklearn's ensemble.RandomForestClassifier() to create your model.
* Fit the data and labels with your model.
* Score your model with the same data and labels.

References:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [103]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10, criterion='entropy', max_features='auto',
                                    max_depth=100, max_leaf_nodes=500, bootstrap=True)

forest.fit(testData, testLabels)
forest.score(valData, valLabels)

# Clearly, our forest isn't that much better than our decision tree in this state

0.76875000000000004

#### Changing Parameters

In [107]:
# Let's first check how changing the number of trees affects the score

for i in range(10):
    temp_forest = RandomForestClassifier(n_estimators=10*i + 1, criterion='entropy', max_features='auto',
                                    max_depth=100, max_leaf_nodes=500, bootstrap=True)
    forest.fit(testData, testLabels)
    print(forest.score(valData, valLabels))
    
# Having a high number of estimators doesn't seem particularly useful, let's go with 20 as a baseline.

0.80625
0.7875
0.78125
0.759375
0.721875
0.759375
0.759375
0.753125
0.775
0.753125


In [108]:
# Select hyperparameters to test
max_depths = [10, 20, 30, 50, 75, 100]
leaf_nodes = [20, 40, 60, 100, 150, 200, 300, 500]
scores = []
# For each value of the hyperparameters, train and score a model.
for i in range(len(max_depths)):
    scores.append([])
    for j in  range(len(leaf_nodes)):
        temp_forest = RandomForestClassifier(n_estimators=20, criterion='entropy', max_features='auto',
                                    max_depth=max_depths[i], max_leaf_nodes=leaf_nodes[j], bootstrap=True)
        temp_forest.fit(testData, testLabels)
        scores[i].append(temp_forest.score(valData, valLabels))

for i in range(len(scores)):
    print(scores[i])

[0.74687499999999996, 0.75312500000000004, 0.74375000000000002, 0.73124999999999996, 0.77187499999999998, 0.75937500000000002, 0.72187500000000004, 0.73124999999999996]
[0.74375000000000002, 0.734375, 0.76249999999999996, 0.75937500000000002, 0.76875000000000004, 0.74687499999999996, 0.734375, 0.73124999999999996]
[0.73124999999999996, 0.72812500000000002, 0.73750000000000004, 0.74687499999999996, 0.73124999999999996, 0.74687499999999996, 0.73124999999999996, 0.76249999999999996]
[0.75312500000000004, 0.78437500000000004, 0.75937500000000002, 0.75312500000000004, 0.74062499999999998, 0.71875, 0.73124999999999996, 0.74062499999999998]
[0.796875, 0.75624999999999998, 0.74375000000000002, 0.75624999999999998, 0.78437500000000004, 0.72812500000000002, 0.74062499999999998, 0.75312500000000004]
[0.76875000000000004, 0.765625, 0.77812499999999996, 0.75624999999999998, 0.74375000000000002, 0.71250000000000002, 0.74687499999999996, 0.75]


In [118]:
# We encounter some randomness in our random forest growth, so let's make a little averaging function:

def testRandomForest(n_estimators=10, criterion='entropy', max_features='auto',
                                    max_depth=100, max_leaf_nodes=500, bootstrap=True, trials=10):
    scores = []
    for i in range(trials):
        temp_forest = RandomForestClassifier(n_estimators=n_estimators, criterion=criterion, max_features=max_features,
                                        max_depth=max_depth, max_leaf_nodes=max_leaf_nodes, bootstrap=bootstrap)
        temp_forest.fit(testData, testLabels)
        scores.append(temp_forest.score(valData, valLabels))
    
    return sum(scores)/len(scores)

In [111]:
# Abridged hyperparameter list to save time.
max_depths = [10, 20, 30, 50]
leaf_nodes = [50, 100, 200, 500]
scores = []
# For each value of the hyperparameters, train and score a model.
for i in range(len(max_depths)):
    scores.append([])
    for j in  range(len(leaf_nodes)):
        scores[i].append(testRandomForest(n_estimators=20, criterion='entropy', max_features='auto',
                                    max_depth=max_depths[i], max_leaf_nodes=leaf_nodes[j], bootstrap=True))
        

for i in range(len(scores)):
    print(scores[i])
    
# We see in general that having higher maximum depth and higher number of leaves leads to higher accuracy. 
# However, averaging our values is computationally expensive and should be avoided.

[0.71875, 0.73062500000000008, 0.73687500000000006, 0.72218750000000009]
[0.74843750000000009, 0.74875000000000003, 0.73999999999999999, 0.739375]
[0.74937500000000001, 0.75812500000000005, 0.74124999999999996, 0.73624999999999996]
[0.75656249999999992, 0.75968750000000007, 0.7421875, 0.73375000000000001]


In [121]:
for t in range(0, 30, 5):
    print(testRandomForest(n_estimators=20, criterion='entropy', max_features='auto',
                                    max_depth=30+t, max_leaf_nodes=100, bootstrap=True))
    
# Note that the general trend is rising, even if the randomness in the forest selection 
#  smudges the effects.

0.7409375
0.736875
0.734375
0.7559375
0.749375
0.7490625


In [122]:
print("With bootstrap, the validation accuracy is:") 
print(testRandomForest(n_estimators=20, criterion='entropy', max_features='auto',
                                    max_depth=30, max_leaf_nodes=100, bootstrap=True, trials=20))
print("Without bootstrap, the validation accuracy is:")
print(testRandomForest(n_estimators=20, criterion='entropy', max_features='auto',
                                    max_depth=30, max_leaf_nodes=100, bootstrap=False, trials=20))

# Strangely, removing bootstrap decreases validation accuracy.

With bootstrap, the validation accuracy is:
0.75625
Without bootstrap, the validation accuracy is:
0.77453125


What parameters did you choose to change and why?

We first chose to play around with n_estimators. It seems like very large numbers of trees in the forest doesn't lead to greatly improved results. Messing around with max_depth and max_leaf_nodes shows that unlike in the regular forests, increasing the depth of the tree does not decrease the validation accuracy. We would guess that this is due to the random nature of the forest: there is no room for overfitting because any overfitted regions of any one tree will be averaged out over the whole forest. One thing we found that was very interesting was that removing bootstrap increases the validation accuracy on average. We have a couple hypotheses for this, but one proposed cause is that our reviews don't tend to center around a mean, since language is such a diverse and variant thing. The most common words are also the words most devoid of positivity and negativity, and the spread of positive and negative words used keep there from being a well defined central tendency that bootstrapping would enhance. Instead, emphasizing the random selection of features forces us to look at different words and make decisions on them, which gives us more perspectives on the words in the dataset.

How does a random forest classifier prevent overfitting better than a single decision tree?

Because a random forest doesn't limit itself to a single section of the dataset, it is less prone to making decision errors on certain splits. The splits are chosen randomly so not every tree encounters the same splits, and as a result, certain splits that misrepresent the actual dataset by overfitting to the training data are averaged out of the system ahead of time. Training in ensemble, then, gives us better results since we can ignore the particular errors in splitting of a single decision tree (equivalent to asking too specific a question in 20 questions), and instead let the groupthink of the random forest make the right decisions. 