# Introduction

This project involves analyzing a provided dataset that contains information about the voting behavior of various counties in the United States. The goal is to use classification methods to predict whether a county will vote "yes" or "no" to legalizing gaming through a ballot.

0 = NO
1 = Yes

# Data Preprocessing

In this data preprocessing I took time just getting a feel for the data given and seeing what I wanted and didnt want in my dataframe.

In [None]:
# Mounting Google Drive add some .shapes mf for the screenshot
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

In [None]:
# reading the data
df = pd.read_csv("/content/drive/MyDrive/IS470_Data/Gaming Ballot Data Set-1.csv")
df

In [None]:
df.keys()

In [None]:
df.dtypes

In [None]:
#Select the desired columns only
desired_columns = ['State No','DEPENDENT VARIABLE', 'BALLOT TYPE', 'POPULATION', 'PCI',
       'MEDIUM FAMILY INCOME', 'POPULATION DENSITY','PERCENT WHITE', 'PERCENT BLACK', 'PERCENT OTHER', 'PERCENT MALE' , 'POVERTY LEVEL'
       ,'UNEMPLOYMENT RATE','AGE LESS THAN 18', 'AGE24', 'AGE44', 'AGE64','AGE OLDER THAN 65', 'MSA']

gaming_desired = df[desired_columns]

In [None]:
df = gaming_desired.copy()
df.loc[:, 'DEPENDENT VARIABLE'] = df['DEPENDENT VARIABLE'].replace({0: 'No', 1: 'Yes'})

In [None]:
df.head(10)

In [None]:
# In this part I removed the $ and the , so I wont run into errors in the future.
df['MEDIUM FAMILY INCOME'] = df['MEDIUM FAMILY INCOME'].replace('\$', '', regex=True).replace(',', '', regex=True)
# Convert the column to float type
df['MEDIUM FAMILY INCOME'] = df['MEDIUM FAMILY INCOME'].astype('int64')

In [None]:
# I did the same thing here and swiched them to int64
df['PCI'] = df['PCI'].replace('\$', '', regex=True).replace(',', '', regex=True)

df['PCI'] = df['PCI'].astype('int64')

In [None]:
# Examine missing values
df.isnull().sum()

In [None]:
df['DEPENDENT VARIABLE'] = df['DEPENDENT VARIABLE'].astype('category')
df['BALLOT TYPE'] = df['BALLOT TYPE'].astype('category')
df['MSA'] = df['MSA'].astype('category')

In [None]:
df.dtypes

In [None]:
# Display all numeric variables
df.select_dtypes(include=['number'])

In [None]:
# Display all category variables
df.select_dtypes(include=['category'])

In [None]:
# I wanted to see if ppoverty level had an impact on their decision on voting yes or no and it is sitting at 42 rows for yes.
df[(df['POVERTY LEVEL'] > 30) & (df['DEPENDENT VARIABLE']== 'Yes')]

In [None]:
# And we are sitting at 26 rows for no.
df[(df['POVERTY LEVEL'] > 30) & (df['DEPENDENT VARIABLE']== 'No')]

In [None]:
# Obtain the variance, standard deviation, and range of a numeric varaible: MEDIUM FAMILY INCOME
print("variance: ", df['MEDIUM FAMILY INCOME'].var(), "standard deviation: ", df['MEDIUM FAMILY INCOME'].std(), "range: ", df['MEDIUM FAMILY INCOME'].min(), df['MEDIUM FAMILY INCOME'].max())

In [None]:
df['MEDIUM FAMILY INCOME'].describe()

In the data preprocessing phase of my project, I focused on transforming data types and conducting analyses on poverty levels. By converting data types and exploring variations in poverty levels, I gained crucial insights that guided my approach for the remainder of the project.

# Data Visulation

This section I will be preforming visulations to gain a better understanding of the data. Finding possible trends or things that might catch my eye insulting in further examination.

In [None]:
# Boxplot of a numeric variable: MEDIUM FAMILY INCOME
snsplot = sns.boxplot(x='MEDIUM FAMILY INCOME', data = df)
snsplot.set_title("Boxplot of MEDIUM FAMILY INCOME")

In [None]:
# In this visulation I wanted to see how a few variables were related to my target variable.
correlation_matrix = df[[ 'DEPENDENT VARIABLE', 'BALLOT TYPE', 'POPULATION', 'PCI',
                         'PERCENT MALE','POVERTY LEVEL','UNEMPLOYMENT RATE', 'AGE24',
                          'AGE44']].corr()

sns.heatmap(correlation_matrix, annot=True)

plt.title('Correlation Matric for numeric features')

plt.xlabel('For Features')

plt.ylabel('Percent white Features')

plt.show()

In [None]:
# this shows that wagering is more accepted compared to gambing.
snsplot = sns.countplot(x='BALLOT TYPE', data=df)
snsplot.set_title("ballot type who picked wagering or gambing")

In [None]:
sns.boxplot(x='POVERTY LEVEL', y='MSA', data=df)
plt.xlabel('POVERTY LEVEL')
plt.ylabel('MSA')
plt.title('Boxplot of poverty level by MSA')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Histogram of a numeric variable: Unemployment Rate
snsplot = sns.histplot(x='UNEMPLOYMENT RATE', data = df)
snsplot.set_title("Histogram of Unemployment rate in data set")

In the data visualization section of my project, I employed various techniques to gain insights into the data. This included using heat maps, histograms specifically focusing on the unemployment rate, and box plots. These visualizations helped me analyze patterns, distributions, and correlations within the data, providing valuable insights for further analysis and model development.

# Model development

In this section I will be developing the models which will be 3 modles in total. I will be analizing these models and seeing which do better or worse. I will eloborate at the end.

## Decision Tree model




In [None]:
df = pd.get_dummies(df, columns=[ 'BALLOT TYPE', 'MSA'], drop_first=True)
df

In [None]:
# Examine the porportion of target variable for data set
target = df['DEPENDENT VARIABLE']
print(target.value_counts(normalize=True))

In [None]:
# Partition the data
predictors = df.drop(['DEPENDENT VARIABLE'],axis=1)
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train.shape, predictors_test.shape, target_train.shape, target_test.shape)

In [None]:
# Combine predictors_train and target_train into a single DataFrame
combined_train_df = pd.concat([predictors_train, target_train], axis=1)

# Separate majority and minority classes
majority_df = combined_train_df[combined_train_df['DEPENDENT VARIABLE'] == 'No']
minority_df = combined_train_df[combined_train_df['DEPENDENT VARIABLE'] == 'Yes']

# Undersample the majority class randomly
undersampled_majority = majority_df.sample(n=len(minority_df), random_state=5)

# Combine the undersampled majority class and the minority class
undersampled_data = pd.concat([undersampled_majority, minority_df])

# Shuffle the combined DataFrame to ensure randomness
balanced_data = undersampled_data.sample(frac=1, random_state=5)

# Split the balanced_data into predictors_train and target_train
predictors_train = balanced_data.drop(columns=['DEPENDENT VARIABLE'])
target_train = balanced_data['DEPENDENT VARIABLE']\

print(target_train.value_counts(normalize=True), target_train.shape)

# Now the data is balanced!!

In [None]:
print(target_test.value_counts(normalize=True))

### Decision Tree depth of 3

In [None]:
# Build a decision tree model on training data with max_depth = 3

model = DecisionTreeClassifier(criterion = "entropy", random_state=1, max_depth = 3)

model.fit(predictors_train, target_train)

In [None]:
# plotting the tree

fig = plt.figure(figsize=(40,30))
tree.plot_tree(model,
               feature_names = list(predictors_train.columns),
               class_names=['No','Yes'],
               filled=True)

In [None]:
# Now I will have the model make a prediction on test data.

prediction_on_test = model.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score

print(classification_report(target_test, prediction_on_test))


### Decision Tree depth of 5

In [None]:
# Build a decision tree model on training data with max_depth = 5

model2 = DecisionTreeClassifier(criterion = "entropy", random_state=1, max_depth = 5)

model2.fit(predictors_train, target_train)

In [None]:
# plotting the tree

fig = plt.figure(figsize=(40,30))
tree.plot_tree(model2,
               feature_names = list(predictors_train.columns),
               class_names=['No','Yes'],
               filled=True)

In [None]:
# Now I will have my second model2 make a prediction on test data.

prediction_on_test = model2.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model2.classes_).plot()

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score

print(classification_report(target_test, prediction_on_test))

In these two modles it shows that going to a length of 5 imporves my model accuracy by 8% totaling at 86% which is huge just from increasing it by two rows. My precision did drop from model 1 (88) to modle 2 (86) however, this little drop helped my presicion on NO increase 13% showing my modle is better at identifying if someone voted yes or no. On recall it did decrease on no however, my Yes did increase so making each model better at catching who voted no and who voted yes.

## Naive Bayes model prediction

In [None]:
# Building a Naive Bayes model on training data
model_NB = MultinomialNB()
model_NB.fit(predictors_train, target_train)

In [None]:
# Make predictions on testing data (0.5 points)
prediction_on_test_NB = model_NB.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm_NB = confusion_matrix(target_test, prediction_on_test_NB)
ConfusionMatrixDisplay(confusion_matrix=cm_NB, display_labels=model_NB.classes_).plot()
#plot_confusion_matrix(model, predictors_test, target_test, cmap=plt.cm.Blues, values_format='d')

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score.
print(classification_report(target_test, prediction_on_test_NB))


The Naive Bayes model achieved an accuracy of 0.56, lower than the Decision Tree model. It showed good recall for class No (.84) but recall for class Yes fell short(0.23). The F1-score for class 1 was 0.32, indicating a need for improvement in balancing precision and recall. These means I need refining the model for better performance.

## **K Nearest Neighbor**

### n_neighbors = 1

In [None]:
# Apply minmax normalization on predictors
min_max_scaler = preprocessing.MinMaxScaler()
predictors_normalized = pd.DataFrame(min_max_scaler.fit_transform(predictors))
predictors_normalized.columns = predictors.columns
predictors_normalized

In [None]:
# Build a K Nearest Neighbor model on training data with n_neighbors = 1
model = KNeighborsClassifier(n_neighbors = 1)
model.fit(predictors_train, target_train)

In [None]:
# Make predictions on training and testing data
prediction_on_train = model.predict(predictors_train)
prediction_on_test = model.predict(predictors_test)

In [None]:
cm = confusion_matrix(target_train, prediction_on_train)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()

In [None]:
# Examine the evaluation results on training data: accuracy, precision, recall, and f1-score (1 points)
print(classification_report(target_train, prediction_on_train ))

# With this model I was expecting to get a perfect model because the model is esentally looking at itself and itself is 100% correct.

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score (1 points)
print(classification_report(target_test,prediction_on_test))
# However it dose porly on test data which is the most importent.

### n_neighbors = 4

In [None]:
# Build a K Nearest Neighbor model on training data with n_neighbors = 3
model2 = KNeighborsClassifier(n_neighbors = 3)
model2.fit(predictors_train, target_train)

In [None]:
# Make predictions on training and testing data
prediction_on_train = model2.predict(predictors_train)
prediction_on_test = model2.predict(predictors_test)

In [None]:
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()

In [None]:
# Examine the evaluation results on test data: accuracy, precision, recall, and f1-score
print(classification_report(target_test, prediction_on_test ))

For my KNN model I yet again recieved lower accuracy compared to my decision tree how ever it increased compared to my Naive Bayes model. All of my categorys increased when I increased k = 3 however, when I increased it more things seem to drop. So there is still some tweaking I need to work out.

# Results and Model Evaluation

Out of all the machine learning models I tested, the decision tree emerged as the star player. Showcasing its ability to unveil intricate data patterns with an outstanding accuracy of 85%. Impressively, it consistently scored 80% or higher in recall, precision, and F1-score metrics. However, the same cannot be said for the other models. My Naive Bayes model struggled, yielding a mere 56% accuracy and a dismal 23% recall for positive votes. Lastly, the KNN model showed improvement, not enough to celebrate. It reached an accuracy of 60% after increasing K to 4. Although its precision in classifying negitive votes was better compared to positive votes there is still work needed to enhance this model further.

# Conclusion
 In conclusion, this project delved into machine learning algorithms to predict voting behavior. The Decision Tree model emerged as the standout performer, showcasing high accuracy and strength in capturing intricate data patterns. On the other hand, Naive Bayes struggled with accuracy and recall, highlighting its limitations. The K-Nearest Neighbors model showed promise after tuning but requires further refinement. Overall, this project emphasizes  the importance of selecting the right algorithm for specific data contexts and objectives, providing valuable insights for future predictive modeling endeavors.

In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/IncremonaBrandonUse.ipynb"