<a href="https://colab.research.google.com/github/SimoneKris/KGS-Data-Analytics-Portfolio/blob/main/MilestoneFinalProjectModelingNotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [41]:
#import librairies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import tree 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier#imputing age with mean value of remaining passengers

In [42]:
#upload data set
from google.colab import files
mass_shootings = files.upload()

Saving Mass Shootings 1982-2022.csv to Mass Shootings 1982-2022 (2).csv


In [43]:
# name dataframe df
df = pd.read_csv('Mass Shootings 1982-2022.csv')

In [44]:
# check the data for values
df.head()

Unnamed: 0,place_of_shooting,city,state,date,summary,fatalities,injured,total_victims,location,age_of_shooter,...,year,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28
0,Greenwood Park Mall shooting,Greenwood,IN,7/17/22,"Jonathan Sapirman, 20, opened fire in a mall f...",3,2,5,Workplace,20.0,...,2022,,,,,,,,,
1,Highland Park July 4 parade shooting,Highland Park,IL,7/4/22,"Suspected gunman Robert ""Bobby"" Crimo, 21, all...",7,46,53,Other,21.0,...,2022,,,,,,,,,
2,Church potluck dinner shooting,Birmingham,AL,6/16/22,"Robert Findlay Smith, 70, opened fire with a h...",3,0,3,Religious,70.0,...,2022,,,,,,,,,
3,Concrete company shooting,Smithsburg,MD,6/9/22,The suspected 23-year-old gunman shot four cow...,3,1,4,Workplace,23.0,...,2022,,,,,,,,,
4,Tulsa medical center shooting,Tulsa,OK,6/1/22,"Michael Louis, 45, killed four, including two ...",4,10,14,Workp;ace,45.0,...,2022,,,,,,,,,


In [45]:
#drop unnamed columns
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

In [46]:
#check data types
print(df.dtypes)

place_of_shooting                    object
city                                 object
state                                object
date                                 object
summary                              object
fatalities                            int64
injured                               int64
total_victims                         int64
location                             object
age_of_shooter                      float64
prior_signs_mental_health_issues     object
mental_health_details                object
weapons_obtained_legally             object
where_obtained                       object
weapon_type                          object
weapon_details                       object
race                                 object
gender                               object
type                                 object
year                                  int64
dtype: object


In [47]:
print(df.shape)

(133, 20)


In [48]:
# remaning columns: prior_signss_menatal_health_issues and weapons_obtained_legally
df.rename(columns = {'prior_signs_mental_health_issues':'mental_issues', 'weapons_obtained_legally':'weapon_legal'}, inplace = True)

**SPLITTING THE DATA:**
The target for this model is prior signs of mental health issues.  I am trying to  correctly classify whether a person is more likely to commint a mass shooting or not if mental health issues are known, based on other factors in the data set.  I will delete "type", "year", "weapon details", "summary", "place of shooting", because they will not provide useful data and also a bit of an overlap with regards to location and place of shooting. 

In [49]:
#drop columns
df = df.drop(['type', 'year', 'weapon_details', 'summary', 'place_of_shooting', 'mental_health_details'], axis = 1)

In [51]:
# define X and y
X = df[['city', 'state', 'date', 'age_of_shooter', 'weapon_legal', 'weapon_type', 'location', 'race', 'total_victims', 'gender']]
y = df['mental_issues']

In [52]:
#imputing race with most common race detected
df['race']=df['race'].fillna(df['race'].mode()[0])

In order to use gender as a variable in the model, that column needs to be encoded.  It is not an ordinal variable, so one-hot encoding is used for this.  I also did some encoding on prior signs of mental health issues.

In [53]:
# ecoding on mental issues

df.loc[df['mental_issues'] == 'yes', 'issues'] = 1
df.loc[df['mental_issues'] == 'no', 'issues'] = 0

In [54]:
#one hot encoding for mental_issues and gender
one_hot = pd.get_dummies(data=X, columns=['gender', 'race'])

print(one_hot.head())

print(one_hot.columns)

X = pd.DataFrame(one_hot)

            city state     date  age_of_shooter weapon_legal  \
0      Greenwood    IN  7/17/22            20.0          Yes   
1  Highland Park    IL   7/4/22            21.0          Yes   
2     Birmingham    AL  6/16/22            70.0          Yes   
3     Smithsburg    MD   6/9/22            23.0          NaN   
4          Tulsa    OK   6/1/22            45.0          Yes   

                                  weapon_type   location  total_victims  \
0                        semiautomatic rifles  Workplace              5   
1                         semiautomatic rifle      Other             53   
2                       semiautomatic handgun  Religious              3   
3                       semiautomatic handgun  Workplace              4   
4  semiautomatic rifle; semiautomatic handgun  Workp;ace             14   

   gender_F  gender_Female  ...  gender_Male & Female  race_Asian  race_Black  \
0         0              0  ...                     0           0           0   
1 

In [55]:
#splittng the data into training, validation and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Next: Split the training/validation dataset into a training set and validation set. Use train_test_split from sklearn.model_selection to split X_train_val and y_train_val into X_train, X_val, y_train, and y_val. Set the test_size = 0.333 (this will be the size of the validation set) and random_state = 42.

In [56]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3333, random_state=42) 

**Clean and Preprocess the Data** 
Some things that are being considered are missing data that needs to be imputed, ecoding features.

In [57]:
#null values
df.isna().sum()

city                0
state               0
date                0
fatalities          0
injured             0
total_victims       0
location            0
age_of_shooter      3
mental_issues      27
weapon_legal       14
where_obtained      0
weapon_type         3
race                0
gender              0
issues            127
dtype: int64

Of the 20 features in the data set, only 3  contain null values.  Even though we do not know if there were prior signs of mental health issues of 27 of the shooters(approximately 20% of the total), we can inpute this by using the average of the remaining 107 shooter.  

In [58]:
#examine relationship between mental_issues and weapon_legal to determine if weapon_legal should be imputed or dropped

df_mental_issues = df[['mental_issues', 'weapon_legal']]
display(df_mental_issues)

print((df_mental_issues == 1).sum())
print(df_mental_issues.isna().sum())
df_mental_issues = df[['mental_issues', 'weapon_legal']]
display(df_mental_issues)

print((df_mental_issues == 1).sum())
print(df_mental_issues.isna().sum())

Unnamed: 0,mental_issues,weapon_legal
0,,Yes
1,,Yes
2,,Yes
3,,
4,,Yes
...,...,...
128,Yes,Yes
129,No,Yes
130,Yes,Yes
131,Yes,No


mental_issues    0
weapon_legal     0
dtype: int64
mental_issues    27
weapon_legal     14
dtype: int64


Unnamed: 0,mental_issues,weapon_legal
0,,Yes
1,,Yes
2,,Yes
3,,
4,,Yes
...,...,...
128,Yes,Yes
129,No,Yes
130,Yes,Yes
131,Yes,No


mental_issues    0
weapon_legal     0
dtype: int64
mental_issues    27
weapon_legal     14
dtype: int64


In [59]:
#count for how many shooters who had known menatl health issues also obtained a gun leagally or not
df.value_counts([(df['mental_issues']== 1) & (df['weapon_legal'] != np.nan)])

False    133
dtype: int64

There seems to be a discrepncy here since none are showing as true.  This either idicates that the majority of shooter obtained the gun through legal channels even though they had know mental health issue.

In [60]:
X.isna().sum()

city                     0
state                    0
date                     0
age_of_shooter           3
weapon_legal            14
weapon_type              3
location                 0
total_victims            0
gender_F                 0
gender_Female            0
gender_M                 0
gender_Male              0
gender_Male & Female     0
race_Asian               0
race_Black               0
race_Brown               0
race_Latino              0
race_Native American     0
race_Other               0
race_White               0
race_White               0
race_With                0
dtype: int64

In [65]:
#imputing race with most common race detected
df['race']=df['race'].fillna(df['race'].mode()[0])

**Building the Model-logictic regression**


In [66]:
log_reg = LogisticRegression(random_state=0)

log_reg_model = log_reg.fit(X_train, y_train)

ValueError: ignored

In [69]:
#KNN Nearest Neighbors - create pipeline

knn_pipe = Pipeline([('imp_mean',SimpleImputer(missing_values=np.nan, strategy='mean')),
                     ('scaler', StandardScaler()), 
                     ('KNN', KNeighborsClassifier())])

knn_pipe.fit(X_train, y_train)

ValueError: ignored

Unfortunately, I am unable to get my models to work.  This has been a constant struggle for me since our machine learning class.  I clearly struggle with the splitting or naming things so that they will pull the data properly.  

Please see other notebook for original graphs that were done in the data exploration process.