# News and Gun Violence:  Bias in the Writing?

### Executive Summary

The problem statement I entered this project looking for an answer to was, Is there racial bias in the words used by newspapers, in this case the New York Times, when covering mass shootings in relation to the ethnicity of the shooter. If so, does the content predict racial bias. 

#### The Data
The data was initially gathered from two main sources the Stanford University Mass Shooting Database and the New York Times. I chose the Stanford database because they are among the leading data collectors when it comes to mass shootings in the US. In fact, they are the ones who created the current definition of what constitutes a mass shooting. That is if four or more people are wounded or killed. Their database consists of every incident between 1966 and March 2016. However, due to the format of the New York Times files I was unable to get articles from the 1960s and thus only used articles from 1970 to March 2016. The New York Times was selected because it is one of a few papers who cover national news that has readily available archived articles. The data chosen from the New York Times was every article and article title related to each incident from the day of the incident and the six days following. This time period was selected because it is the typical length of one news cycle.

The final data features that would be used for the model would be the Incident Name and Shooter Race features from the Stanford Database and each articles content and its title. After reviewing the ethnicities in the database I discovered that Latin Americans were largely classified as "other" and as such I researched each individual to confirm their ethnicity and changed it in the data frame. The races included were (in order of Number of Shooters) White, Asian, African American, Two or more races, Latin American, Native American or Alaska Native, Other, or Unknown.

#### Preparing the Data
In order to prepare the data for Natural Language processing certain steps had to occur to assure smooth processing and least amount of unnecessary words. This was done by using a function that removed non-letters, made every letter lower case, and removed the standard English dictionary words as well as my custom list of stopwords which included the location of the shooting, the shooters name, month, and the name of the incident. In addition, in using, Tokenizer, Lemmatizer, and Stemming I was able to cut down on the repeating of words that are basically the same like run, runs, running, and ran. 

The other major challenge here with the data was to manage the unbalanced classes before doing a train-test split or countvectorizing. The numbers of occurrences were massively imbalanced because as expected there ethnicities were not evenly spread. The spread, in percent, is as follows:
    White = 68.66%
    Asian = 14.43%
    African American = 14.43%
    Latin American = 2.09%
    Native American or Alaska Native = 1.33%
These numbers are actually quite interesting when you compare them to the demographic make up of the entire US populationm, which is as follows:
    White = 72.4%
    Asian = 4.8%
    African American = 12.6%
    Latin American = 16.3%
    Naive American or Alaska Native = 0.9%
To adjust for the unbalanced classes I resampled the data so all ethnicites had the same amoutn of observations as teh majority class, White. This of course generates many duplicates and is generating new random data from the original data.

To adjust for the unbalanced classes I resampled the data so all ethnicities had the same amount of observations as the majority class, White. This of course generates many duplicates and is generating new random data from the original data.

### Model Selection
The current model I am working with is a Random Forest Classifier as I am attempting to see if certain words lead to a shooter being classified as a certain ethnicity and how accurate that decision is based on the decision made by the Random Forest.

### Statisitcal Analysis, Recommendations, and Next Steps
The Random Forest generated an accuracy score of .8557. The model is extremely precise at predicting Unknown(100%),Native American or Alaska Native(99%),Other(98%), Latin American(92%). The next cluster of precision scores are African Americans(82%) and those of Two or More Races (82%). Interestingly, the two populations that have the greatest number of perpetrators have the lowest precision scores, Asian (61%) and White (72%). 

While the model appears to be quite accurate once the feature importance’s are printed out it is clear that there are no certain words that appear in an article by the New York Times that would specifically indicate the individual race of the shooter. While this does allow me to reject my hypothesis (good news for the New York Times) moving forward I would like to look at more national newspapers and see how they differ by location, region, political leaning. I would also like to see how the local papers covered events to see if their focuses are primarily on the crime, the shooter, and event that occurred or upon helping the families affected by the horrible events.

##### Loading Packages

In [1]:
import nltk
from nltk.corpus import stopwords
import itertools
from itertools import chain
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.learning_curve import learning_curve
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.pipeline import Pipeline
%matplotlib inline
import random
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from tqdm import tqdm
tqdm.pandas(desc='progress-bar')
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.utils import resample
%matplotlib inline



## Exploratory Data Analysis and Database Cleaning

In [6]:
NYT1970s=pd.read_csv('./Assets/NYT 1970s Shootings-2.csv')
NYT1980s=pd.read_csv('./Assets/NYT 1980s Shootings-3.csv')
NYT1990s=pd.read_csv('./Assets/NYT 1990s Shootings-3.csv')
NYT2000s=pd.read_csv('./Assets/NYT 2000s Shootings-2.csv')
NYT2010s=pd.read_csv('./Assets/NYT 2010s-4 Shootings.csv')
shootingdb= pd.read_excel('./Assets/MSA/Stanford_MSA_Database_for_release_06142016.xlsx')

In [7]:
NYT1970s=NYT1970s[['Incident Name','Title','Article','Shooter Race']]
NYT1980s=NYT1980s[['Incident Name','Title','Article','Shooter Race']]
NYT1990s=NYT1990s[['Incident Name','Title','Article','Shooter Race']]
NYT2000s=NYT2000s[['Incident Name','Title','Article','Shooter Race']]
NYT2010s=NYT2010s[['Incident Name','Title','Article','Shooter Race']]

In [8]:
NYT = pd.concat([NYT1970s,NYT1980s, NYT1990s, NYT2000s,NYT2010s])
NYT.reset_index(drop=True)

Unnamed: 0,Incident Name,Title,Article,Shooter Race
0,NOLA PD,,,
1,Clara Barton Elementary,,,
2,Olean High School,Sniper's Classmate Says Guns Were ‘Whole Life’,The attack at the school has stunned this comm...,White
3,Olean High School,"3 Killed and 9 Wounded By an Upstate Sniper, 18",The youth was charged with three counts of mur...,White
4,LA Computer Learning Center,,,
5,Cal State Fullerton,,,
6,Grover Cleveland Elementary School,San Diego Girl Slays 2 With Rifle And Wounds 9...,Special weapons and tactics officers from the ...,White
7,Grover Cleveland Elementary School,Tomboy and Gun Enthusiast,"SAN DIEGO, Jan. 29 (AP) — Brenda Spencer's cla...",White
8,Grover Cleveland Elementary School,Coast Sniper Vowed She Would ‘Do Something Big’,"SAN DIEGO, Jan. 30 — Wally Spencer's eyes were...",White
9,Univeristy of South Carolina,The New York Times,Radioactive Tritium Seizure Brings Bankruptcy ...,African American


In [9]:
#Ritika EDA
def eda(dataframe):
    print "Missing Values \n \n", dataframe.isnull().sum(),"\n" #find missing values
    print "Duplicate Rows \n", dataframe.duplicated().sum(),"\n" #find duplicated values
    print "Dataframe Types \n \n", dataframe.dtypes,"\n" #datatypes of each column
    print "Dataframe Shape \n", dataframe.shape,"\n" #number of rows and columns
    print "Dataframe Describe \n \n", dataframe.describe(include='all'),"\n" #Describe all columns
    for feature in dataframe: # Prints unique values for each column 
        print feature
        print dataframe[feature].nunique()

In [10]:
eda(NYT)

Missing Values 
 
Incident Name      0
Title            133
Article          133
Shooter Race     133
dtype: int64 

Duplicate Rows 
11 

Dataframe Types 
 
Incident Name    object
Title            object
Article          object
Shooter Race     object
dtype: object 

Dataframe Shape 
(1194, 4) 

Dataframe Describe 
 
                            Incident Name               Title  \
count                                1194                1061   
unique                                291                1021   
top     Tucscon, Arizona - Gabby Giffords  The New York Times   
freq                                   83                   7   

                                                  Article Shooter Race  
count                                                1061         1061  
unique                                               1032            8  
top     The Lede is a blog that remixes national and i...        White  
freq                                                    9     

In [11]:
NYT.dropna(inplace=True)
NYT.drop_duplicates(inplace=True)

In [12]:
eda(NYT)

Missing Values 
 
Incident Name    0
Title            0
Article          0
Shooter Race     0
dtype: int64 

Duplicate Rows 
0 

Dataframe Types 
 
Incident Name    object
Title            object
Article          object
Shooter Race     object
dtype: object 

Dataframe Shape 
(1053, 4) 

Dataframe Describe 
 
                            Incident Name               Title  \
count                                1053                1053   
unique                                161                1021   
top     Tucscon, Arizona - Gabby Giffords  The New York Times   
freq                                   83                   7   

                                                  Article Shooter Race  
count                                                1053         1053  
unique                                               1032            8  
top     The Lede is a blog that remixes national and i...        White  
freq                                                    9          723 

In [13]:
NYT.replace("N.R.A","NRA",inplace=True)

### Generating Stop Words

In [None]:
names=shootingdb['Shooter Name'].tolist()
incidents=NYT['Incident Name'].tolist()
location=shootingdb['Location'].tolist()
city=shootingdb['City'].tolist()
state=shootingdb['State'].tolist()
title=shootingdb['Title'].tolist()
month =['jan','feb', 'mar', 'apr','june','july','aug', 'sep','oct','nov','dec']
dbstops= [names + incidents + location + city + state + title + month]

In [None]:
dbstop=list(itertools.chain(*dbstops))
dbstop = [i.split(" ") for i in dbstop]
dbstop = [item for sublist in dbstop for item in sublist]
dbstop = [i.lower() for i in dbstop]
stop=stopwords.words('english')
stop += dbstop

### Cleaning Dataframe

In [None]:
def remove(NYT):
    # Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", NYT) 
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()                                              
    #Remove stop words
    meaningful_words = [i for i in words if not i in stop]   
    # Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [None]:
NYT['Title']=NYT['Title'].apply(remove)

In [None]:
NYT['Article']=NYT['Article'].apply(remove)

In [None]:
NYT.reset_index(drop=True, inplace=True)

In [None]:
#Evann G Smiths Function (https://github.com/evanngsmith/GA)
lemma = WordNetLemmatizer()
stemmer = PorterStemmer()
re_punct = re.compile('[' + ''.join(string.punctuation) + ']')

def preprocess(text):
    try:
        tokens = word_tokenize(text)
        tokens = [t for t in tokens if len(t) > 2]
        tokens = [stemmer.stem(t) for t in tokens]
        tokens = [lemma.lemmatize(t) for t in tokens]
        if len(tokens) == 0:
            return None
        else:
            return ' '.join(tokens)
    except:
        return None

In [None]:
NYT['Title']=NYT['Title'].progress_map(preprocess)

In [None]:
NYT['Article']=NYT['Article'].progress_map(preprocess)

In [None]:
NYT.head()

##### Base Accuracy & Resampling

In [None]:
NYT['Shooter Race'].value_counts()

In [None]:
White Base Accuracy = 68.66
Asian Base Accuracy = 14.43
African American Base Accuracy = 11.11
Latin American Base Accuracy = 2.09
Native American or Alaska Native =1.33

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority1 = NYT[NYT['Shooter Race'] == "Asian"]

# Upsample minority class
minority_upsampled1 = resample(minority1,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible results

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority2 = NYT[NYT['Shooter Race'] == "African American"]

# Upsample minority class
minority_upsampled2 = resample(minority2,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible results

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority3 = NYT[NYT['Shooter Race'] == "Two or more races"]

# Upsample minority class
minority_upsampled3 = resample(minority3,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible results

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority4 = NYT[NYT['Shooter Race'] == "Latin American"]

# Upsample minority class
minority_upsampled4 = resample(minority4,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible resultsbb

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority5 = NYT[NYT['Shooter Race'] == "Native American or Alaska Native"]

# Upsample minority class
minority_upsampled5 = resample(minority5,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible results

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority6 = NYT[NYT['Shooter Race'] == "Other"]

# Upsample minority class
minority_upsampled6 = resample(minority6,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible results

In [None]:
majority = NYT[NYT['Shooter Race'] == "White"]
minority7 = NYT[NYT['Shooter Race'] == "Unknown"]

# Upsample minority class
minority_upsampled7 = resample(minority7,
                                 replace=True,     # sample with replacement
                                 n_samples=723,    # to match majority class
                                 random_state=66) # reproducible resultsb

In [None]:
NYT = pd.concat([majority, minority_upsampled1,minority_upsampled2,minority_upsampled3
                         ,minority_upsampled4,minority_upsampled5, minority_upsampled6,minority_upsampled7])
NYT.shape
NYT.head()

### CountVectorizer

In [None]:
vectorizer = CountVectorizer(analyzer = "word", 
                             tokenizer = None,    
                             preprocessor = None,
                             stop_words=stop,
                             max_features=1500,
                             min_df=1) 

data_features = vectorizer.fit_transform(NYT['Article'])
data_features = data_features.toarray()
print vectorizer

In [None]:
data_features.shape
print data_features

In [None]:
vocab=vectorizer.get_feature_names()
print vocab

In [None]:
vocab2=pd.DataFrame(data_features, columns=vocab)
vocab2.head()

In [None]:
largedf=NYT.merge(vocab2, left_index=True  ,right_index=True, how='inner')
largedf = largedf.rename(columns = {'fit': 'fit_feature'})
largedf.head()

### Model - Random Forest

In [None]:
X = largedf.loc[:,'abandon':'zappala'] 
y = largedf['Shooter Race']  
#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    a = accuracy_score(y_test, y_pred)
    
    cm = confusion_matrix(y_test, y_pred)
    cr = classification_report(y_test, y_pred)
    
    print cm
    print cr
    return a

In [None]:
rf = RandomForestClassifier(min_samples_split=2, n_estimators=100, criterion='gini', 
                            max_depth=10, class_weight=None,random_state=86)
evaluate_model(rf)

In [None]:
from sklearn.grid_search import GridSearchCV

params = {'n_estimators':[1, 10, 100, 1000],
          'criterion': ['gini', 'entropy'],
          'max_depth': [1, 3, 5,7,10],
          'min_samples_split': [2,5],
          'class_weight':[None, 'balanced']}

gsrf = GridSearchCV(rf, params, n_jobs=-1, cv=10)
gsrf.fit(X, y)
print gsrf.best_params_
print gsrf.best_score_

print gsrf.best_estimator_,
evaluate_model(gsrf.best_estimator_)

In [None]:
importance = pd.DataFrame(zip(rf.feature_importances_,),
                           index=X.columns,
                           columns=['Word Importance']).sort_values('Word Importance',
                                                                   ascending=False)
importance.head(10)

In [None]:
import matplotlib.pyplot as plt
all(rf.feature_importances_ == np.mean([tree.feature_importances_ for tree in rf.estimators_], axis=0))

importances = rf.feature_importances_
# calculate the standard deviation of feature importances by looping over the trees in the random forest
# 

std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)

indices = np.argsort(importances)[::-1]
feature_names = X.columns

# Plot the feature importances of the forest
plt.figure(figsize=(10,10))
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.xlim(0,20)
plt.show()