# **Gun Crime analysis**

Hypothethis: Hypothesis 1: Suicide rates are higher for individuals with lower education levels.

Rationale: Socio-economic factors such as lack of resources, and oppotunities are often linked to mental health and stress levels.


## Objectives

1. **Load the Dataset**: Import the necessary CSV files into dataframes.
2. **Explore the Data**: Perform initial exploration to understand the structure and content of the data.
3. **Data Cleaning and Refinement**: Clean the data by handling missing values, correcting data types, and refining the dataset for analysis.
4. **Surface Level Analysis**: Conduct preliminary analysis to identify trends, patterns, and key statistics.
5. **Basic Visualizations**: Create visual representations of the data to aid in understanding and communicating findings.

## Inputs

* 
## Outputs

* 

## Additional Comments

* 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Compu\\Crime-Data\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Compu\\Crime-Data'

In [4]:
import os

# Check if the file exists in the current directory
file_path = os.path.join(current_dir, 'gun_deaths.csv')
if os.path.exists(file_path):
    print("File exists")
else:
    print("File does not exist")

File exists


# Section 1

## Data Exploration, Cleaning, and Refinement

### Data Exploration
Data exploration is the first step and usually involves exploring the data, seeing its structure and how it is presented. This step includes:
1. **Loading the Dataset**:
2. **Initial Exploration**: 
3. **Summary Statistics**: 
4. **Identifying Missing Values**: 
5. **Identifying Duplicate Rows**: 

It is important to take these steps as it may ruin our analysis and therefore affect our findings.

### Data Cleaning
Data cleaning involves handling issues identified during exploration to ensure the dataset is accurate and reliable. This step includes:
1. **Handling Missing Values**: 
2. **Removing Duplicate Rows**: 
3. **Correcting Data Types**: 

### Data Refinement
Data refinement involves further processing the cleaned dataset to prepare it for analysis. This step includes:
1. **Feature Engineering**: 
2. **Normalization and Scaling**:
3. **Encoding Categorical Variables**: 

## Ethical Considerations

### Privacy and Confidentiality
The dataset contains sensitive information about individuals who have died due to gun-related incidents. It is crucial to ensure that the data is anonymized and does not contain any personally identifiable information (PII). In this dataset, all personal identifiers have been removed, and only aggregated data is used for analysis.

### Data Accuracy and Integrity
Ensuring the accuracy and integrity of the data is essential to avoid misleading conclusions. During the data cleaning process, we handled missing values, corrected data types, and removed duplicate rows to maintain the dataset's reliability.

### Handling Outliers
Outliers can significantly impact the results of the analysis. In this dataset, outliers in the age column were identified using Z-scores. However, since these outliers represent real ages of victims, they were retained to avoid skewing the results and to maintain the integrity of the data.

### Bias and Fairness
It is important to recognize and address any potential biases in the dataset. For instance, the dataset may have inherent biases based on the demographic distribution of the data. To mitigate this, we conducted a thorough exploration and cleaning process to ensure that the analysis is as unbiased as possible.

### Ethical Reporting
When reporting the findings, it is essential to present the results transparently and avoid any manipulation of data to support a particular hypothesis. The analysis and visualizations are conducted objectively, and the results are reported accurately.

### Overcoming Ethical Issues
1. **Anonymization**: Ensured that the dataset does not contain any PII.
2. **Data Cleaning**: Handled missing values, corrected data types, and removed duplicates to maintain data integrity.
3. **Outlier Handling**: Retained outliers that represent real data to avoid skewing results.
4. **Bias Mitigation**: Conducted thorough data exploration and cleaning to minimize biases.
5. **Transparent Reporting**: Presented findings objectively and accurately without manipulating data.

By addressing these ethical considerations, we aim to conduct a responsible and unbiased analysis of the dataset.


In [None]:

# Importing the required libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import streamlit as st


Importing the required libraries is essential and the foundation of any data analysis of a dataset.

In [6]:
# Load the dataset
data = pd.read_csv('gun_deaths.csv')

# Display the first 5 rows of the dataset
data.head()


Unnamed: 0,year,month,intent,police,sex,age,race,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+
1,2012,1,Suicide,0,F,21.0,White,Street,Some college
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+
3,2012,2,Suicide,0,M,64.0,White,Home,BA+
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED


In [7]:
# Get a summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100798 entries, 0 to 100797
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   year       100798 non-null  int64  
 1   month      100798 non-null  int64  
 2   intent     100797 non-null  object 
 3   police     100798 non-null  int64  
 4   sex        100798 non-null  object 
 5   age        100780 non-null  float64
 6   race       100798 non-null  object 
 7   place      99414 non-null   object 
 8   education  99376 non-null   object 
dtypes: float64(1), int64(3), object(5)
memory usage: 6.9+ MB


In [None]:
# Get descriptive statistics for the dataset because the data is numerical
data.describe()


Unnamed: 0,year,month,police,age
count,100798.0,100798.0,100798.0,100780.0
mean,2013.000357,6.567601,0.013909,43.857601
std,0.816278,3.405609,0.117114,19.496181
min,2012.0,1.0,0.0,0.0
25%,2012.0,4.0,0.0,27.0
50%,2013.0,7.0,0.0,42.0
75%,2014.0,9.0,0.0,58.0
max,2014.0,12.0,1.0,107.0


In [9]:
# Check for missing values as it may affect the analysis
missing_values = data.isnull().sum()
missing_values

year            0
month           0
intent          1
police          0
sex             0
age            18
race            0
place        1384
education    1422
dtype: int64

Their are a considerable amount of missing values in the age, place and education section that will affect analysis, therefore, we will drop the rows with missing variables.

In [10]:
# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows

39227

The code above demonstrates that their is 39227 duplicated rows in the dataset which presents a serious issue, therefore we must remove any duplications in the data cleaning process to provide accurate results in our hypothesis testing.

In [42]:
# Calculate the Z-scores for the numerical columns
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))

# Define a threshold for identifying outliers
threshold = 2.5

outliers = data[(z_scores > threshold).any(axis=1)]

# Display the Z-scores and the outliers
print(outliers)  # Explicitly print the DataFrame


        year month   intent police sex    age   race              place  \
8655    2012    11  Suicide      0   M   99.0  White               Home   
18824   2012     2  Suicide      0   M  102.0  Black               Home   
25306   2012     9  Suicide      0   M   99.0  White  Other unspecified   
68005   2014    11  Suicide      0   M  101.0  White               Home   
70663   2014     5  Suicide      0   M  102.0  White               Home   
77776   2014     9  Suicide      0   M   99.0  White               Home   
84254   2014     6  Suicide      0   M   99.0  White               Home   
86719   2014    11  Suicide      0   F  101.0  Black  Other unspecified   
100720  2014     9  Suicide      0   M   99.0  White               Home   

           education  
8655             BA+  
18824            BA+  
25306         HS/GED  
68005   Less than HS  
70663         HS/GED  
77776   Some college  
84254         HS/GED  
86719   Less than HS  
100720  Less than HS  



After running a Z test, the outliers detected are in the age section, their are three main ways to address this:

1. **Drop the outliers**
2. **Keep the outliers**
3. **Replace the values with a statistical method using the mean, mode, or median**

However, the outliers will be kept, due to the fact that they are real data. The outlier ages are actual ages of victims of gun-related deaths. Another reason is because removing them would askew the resuls, purposely removing real ages of victims due to falling outside the median can be percieved as a form of manipulation of the data.

## Begin Data Cleaning and Refinement


In [12]:
# Drop duplicate rows in the dataset
data = data.drop_duplicates()

# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows


0

Now acting upon the findings found during the data exploration, the duplicated rows have been dropped. 

In [13]:
# Drop rows with missing values in 'intent' column
data = data.dropna(subset=['intent'])

# Drop rows with missing values in 'age' column
data = data.dropna(subset=['age'])

# Drop rows with missing values in 'place' column
data = data.dropna(subset=['place'])

# Drop rows with missing values in 'education' column
data = data.dropna(subset=['education'])



The rows with missing values has been dropped.

In [14]:
# Correcting data types
data['year'] = data['year'].astype('category')
data['month'] = data['month'].astype('category')
data['police'] = data['police'].astype('category')
data['sex'] = data['sex'].astype('category')
data['race'] = data['race'].astype('category')
data['place'] = data['place'].astype('category')
data['education'] = data['education'].astype('category')


In [None]:
# Double-check if the data types have been corrected
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58927 entries, 0 to 100797
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   year       58927 non-null  category
 1   month      58927 non-null  category
 2   intent     58927 non-null  object  
 3   police     58927 non-null  category
 4   sex        58927 non-null  category
 5   age        58927 non-null  float64 
 6   race       58927 non-null  category
 7   place      58927 non-null  category
 8   education  58927 non-null  category
dtypes: category(7), float64(1), object(1)
memory usage: 1.7+ MB


In [16]:
# Check if data is cleaned
missing_values = data.isnull().sum()
print (missing_values)

year         0
month        0
intent       0
police       0
sex          0
age          0
race         0
place        0
education    0
dtype: int64


In [23]:
# Drop homicide rows from the dataset as our hypothesis is not about homicide
data = data[data['intent'] != 'Homicide']

# Check first 5 rows of the dataset
data.head()

Unnamed: 0,year,month,intent,police,sex,age,race,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+
1,2012,1,Suicide,0,F,21.0,White,Street,Some college
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+
3,2012,2,Suicide,0,M,64.0,White,Home,BA+
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED


Data cleaning is now sufficient and I can now proceed to test and analyse the data.

In [28]:
# Save the cleaned dataset and proceed to the next step
data.to_csv('cleaned_gun_deaths.csv')

---

# Section 2: Hypothesis testing.

Now that the data has been explored, cleaned and now refined, we can begin our hypothesis.

Our hypothesis: Hypothesis 1: Suicide rates are higher for individuals with lower education levels.

Rationale: Studies such as 'A Study of Suicide and Socioeconomic Factors' have analysed data from the G7 countries and concluded that family derived from low income are at higher risk of suicide. The result of this research is to determine whether the trend continues in America.

To test, I will conduct a chi-squared test to test for any signification variation for suicide across all education levels and then visualise the results.

Their will be two hypothesis, a null, and an alternative. A null hypothesis states their is no significant difference for the suicide rates based on education levels, meanwhile, an alternative hypothesis states that their will be.



In [44]:
# Load the cleaned dataset
data = pd.read_csv('cleaned_gun_deaths.csv')

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [18]:
import os
try:
  # Create a new folder in the current directory
  os.makedirs(os.path.join(current_dir, 'new_folder'))
except Exception as e:
  print(e)


[WinError 183] Cannot create a file when that file already exists: 'c:\\Users\\Compu\\Crime-Data\\new_folder'
