# **Gun Crime analysis**

Hypothethis: Hypothesis 1: Suicide rates are higher among individuals with lower educational attainment.

Rationale: Socioeconomic factors, including education level, are often correlated with mental health and access to resources. Lower education may indicate increased stress and limited opportunities, potentially contributing to higher suicide rates.

Analysis: You could compare the suicide rates across different education levels. Calculate the percentage of gun deaths that are suicides for each education category (e.g., "Less than HS", "HS/GED", "BA+"). Use statistical tests (e.g., chi-squared test) to see if the differences are statistically significant.

## Objectives

1. **Load the Dataset**: Import the necessary CSV files into dataframes.
2. **Explore the Data**: Perform initial exploration to understand the structure and content of the data.
3. **Data Cleaning and Refinement**: Clean the data by handling missing values, correcting data types, and refining the dataset for analysis.
4. **Surface Level Analysis**: Conduct preliminary analysis to identify trends, patterns, and key statistics.
5. **Basic Visualizations**: Create visual representations of the data to aid in understanding and communicating findings.

## Inputs

* 
## Outputs

* 

## Additional Comments

* 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Compu\\Crime-Data\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Compu\\Crime-Data'

In [4]:
import os

# Check if the file exists in the current directory
file_path = os.path.join(current_dir, 'gun_deaths.csv')
if os.path.exists(file_path):
    print("File exists")
else:
    print("File does not exist")

File exists


# Section 1

## Data Exploration, Cleaning, and Refinement

### Data Exploration
Data exploration involves examining the dataset to understand its structure, content, and key characteristics. This step includes:
1. **Loading the Dataset**: Importing the dataset into a DataFrame.
2. **Initial Exploration**: Displaying the first few rows to get a sense of the data.
3. **Summary Statistics**: Using methods like `info()` and `describe()` to get an overview of the dataset, including data types, missing values, and basic statistical measures.
4. **Identifying Missing Values**: Checking for missing values in the dataset, which can affect analysis and need to be addressed.
5. **Identifying Duplicate Rows**: Checking for duplicate rows that can inflate the dataset and skew analysis results.

### Data Cleaning
Data cleaning involves handling issues identified during exploration to ensure the dataset is accurate and reliable. This step includes:
1. **Handling Missing Values**: Addressing missing values by either filling them with appropriate values (e.g., mean, median) or removing rows/columns with excessive missing data.
2. **Removing Duplicate Rows**: Dropping duplicate rows to ensure each entry in the dataset is unique.
3. **Correcting Data Types**: Ensuring that each column has the correct data type (e.g., converting columns to categorical or numerical types as needed).

### Data Refinement
Data refinement involves further processing the cleaned dataset to prepare it for analysis. This step includes:
1. **Feature Engineering**: Creating new features or modifying existing ones to enhance the dataset's predictive power.
2. **Normalization and Scaling**: Normalizing or scaling numerical features to ensure they are on a similar scale, which can improve the performance of certain algorithms.
3. **Encoding Categorical Variables**: Converting categorical variables into numerical format using techniques like one-hot encoding or label encoding.

By following these steps, we ensure that the dataset is well-prepared for meaningful analysis and visualization, leading to more accurate and insightful results.

In [5]:

# Importing the required libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm


Importing the required libraries is essential and the foundation of any data analysis of a dataset.

In [6]:
# Load the dataset
data = pd.read_csv('gun_deaths.csv')

# Display the first 5 rows of the dataset
data.head()


Unnamed: 0,year,month,intent,police,sex,age,race,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+
1,2012,1,Suicide,0,F,21.0,White,Street,Some college
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+
3,2012,2,Suicide,0,M,64.0,White,Home,BA+
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED


```markdown
# Initial Exploration of the Dataset

The first five rows of the dataset provide a snapshot of the data structure and content. Here are some key observations:

1. **Columns**:
    - `year`: The year in which the incident occurred.
    - `month`: The month in which the incident occurred.
    - `intent`: The intent behind the incident (e.g., Suicide, Homicide).
    - `police`: Indicates whether a police officer was involved (0 for no, 1 for yes).
    - `sex`: The gender of the victim (M for male, F for female).
    - `age`: The age of the victim.
    - `race`: The race of the victim.
    - `place`: The location where the incident occurred.
    - `education`: The education level of the victim.

2. **Data Types**:
    - The `year`, `month`, and `police` columns are of integer type.
    - The `age` column is of float type, indicating that it may contain missing values.
    - The `intent`, `sex`, `race`, `place`, and `education` columns are of object type, representing categorical data.

3. **Missing Values**:
    - The `intent` column has 1 missing value.
    - The `age` column has 18 missing values.
    - The `place` column has 1,384 missing values.
    - The `education` column has 1,422 missing values.

4. **Duplicate Rows**:
    - There are 39,227 duplicate rows in the dataset, which need to be addressed during data cleaning.

5. **Distribution of Education Levels**:
    - The dataset contains various education levels, with `HS/GED` being the most common, followed by `Less than HS`, `Some college`, and `BA+`. There are also 1,422 missing values in the `education` column.

These initial findings highlight the need for data cleaning and refinement to handle missing values, remove duplicate rows, and ensure accurate data types for meaningful analysis.


In [7]:
# Get a summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100798 entries, 0 to 100797
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   year       100798 non-null  int64  
 1   month      100798 non-null  int64  
 2   intent     100797 non-null  object 
 3   police     100798 non-null  int64  
 4   sex        100798 non-null  object 
 5   age        100780 non-null  float64
 6   race       100798 non-null  object 
 7   place      99414 non-null   object 
 8   education  99376 non-null   object 
dtypes: float64(1), int64(3), object(5)
memory usage: 6.9+ MB


```markdown
The `data.info()` method provides a concise summary of the DataFrame, which includes the number of entries (rows), the number of columns, and the data types of each column. It also shows the non-null count for each column, indicating how many non-missing values are present. This summary helps in understanding the structure of the dataset, identifying columns with missing values, and verifying the data types of each column. For instance, in our dataset, we have 100,798 entries and 9 columns, with some columns containing missing values (e.g., 'intent', 'age', 'place', and 'education'). The data types include integers, floats, and objects (categorical data).
```

In [8]:
# Get descriptive statistics for the dataset because the dataset contains numerical and categorical variables
data.describe()


Unnamed: 0,year,month,police,age
count,100798.0,100798.0,100798.0,100780.0
mean,2013.000357,6.567601,0.013909,43.857601
std,0.816278,3.405609,0.117114,19.496181
min,2012.0,1.0,0.0,0.0
25%,2012.0,4.0,0.0,27.0
50%,2013.0,7.0,0.0,42.0
75%,2014.0,9.0,0.0,58.0
max,2014.0,12.0,1.0,107.0


```markdown
The `df.describe()` method in pandas provides a summary of the central tendency, dispersion, and shape of a dataset's distribution, excluding `NaN` values. It generates descriptive statistics for numerical columns by default, including:

- **Count**: The number of non-null entries.
- **Mean**: The average value.
- **Std**: The standard deviation, which measures the spread of the data.
- **Min**: The minimum value.
- **25%**: The 25th percentile (first quartile).
- **50%**: The 50th percentile (median or second quartile).
- **75%**: The 75th percentile (third quartile).
- **Max**: The maximum value.

For example, calling `data.describe()` on our dataset will provide these statistics for columns like `year`, `month`, `age`, etc., helping us understand the distribution and variability of the data.
```

In [9]:
# Check for missing values as it may affect the analysis
missing_values = data.isnull().sum()
missing_values

year            0
month           0
intent          1
police          0
sex             0
age            18
race            0
place        1384
education    1422
dtype: int64

```markdown
The `data.isnull()` method in pandas is used to detect missing values in the DataFrame. It returns a DataFrame of the same shape as `data`, with boolean values indicating whether an element is missing (`True`) or not (`False`). By summing the result with `data.isnull().sum()`, we get the count of missing values for each column.

### Findings:
- `intent`: 1 missing value
- `age`: 18 missing values
- `place`: 1,384 missing values
- `education`: 1,422 missing values

These missing values need to be addressed during data cleaning to ensure accurate analysis.
```

In [10]:
# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows

39227

In [11]:
# Calculate the Z-scores for the 'age' column
z_scores = stats.zscore(data['age'])

# Convert the Z-scores to absolute values
abs_z_scores = np.abs(z_scores)

# Define a threshold for identifying outliers (e.g., Z-score > 3)
threshold = 3

# Identify the outliers
outliers = data[abs_z_scores > threshold]

# Display the outliers
outliers

Unnamed: 0,year,month,intent,police,sex,age,race,place,education


After running a Z test, the outliers detected are in the age section, their are three main ways to address this:

1. **Drop the outliers**
2. **Keep the outliers**
3. **Replace the values with a statistical method using the mean, mode, or median**

However, the outliers will be kept, due to the fact that they are real data. The outlier ages are actual ages of victims of dun-related deaths. Another reason is because removing them would askew the resuls, purposely removing real ages of victims due to falling outside the median can be percieved as a form of manipulation of the data.

## Begin Data Cleaning and Refinement


In [12]:
# Drop duplicate rows in the dataset
data = data.drop_duplicates()

# Check for duplicate rows in the dataset
duplicate_rows = data.duplicated().sum()
duplicate_rows


0

Now acting upon the findings found during the data exploration, the duplicated rows have been dropped. 

In [13]:
# Drop rows with missing values in 'intent' column
data = data.dropna(subset=['intent'])

# Drop rows with missing values in 'age' column
data = data.dropna(subset=['age'])

# Drop rows with missing values in 'place' column
data = data.dropna(subset=['place'])

# Drop rows with missing values in 'education' column
data = data.dropna(subset=['education'])



In [14]:
# Correcting data types
data['year'] = data['year'].astype('category')
data['month'] = data['month'].astype('category')
data['police'] = data['police'].astype('category')
data['sex'] = data['sex'].astype('category')
data['race'] = data['race'].astype('category')
data['place'] = data['place'].astype('category')
data['education'] = data['education'].astype('category')


In [15]:
# Verify the changes
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58927 entries, 0 to 100797
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   year       58927 non-null  category
 1   month      58927 non-null  category
 2   intent     58927 non-null  object  
 3   police     58927 non-null  category
 4   sex        58927 non-null  category
 5   age        58927 non-null  float64 
 6   race       58927 non-null  category
 7   place      58927 non-null  category
 8   education  58927 non-null  category
dtypes: category(7), float64(1), object(1)
memory usage: 1.7+ MB


In [16]:
# Check if data is cleaned
missing_values = data.isnull().sum()
print (missing_values)

year         0
month        0
intent       0
police       0
sex          0
age          0
race         0
place        0
education    0
dtype: int64


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [17]:
import os
try:
  # Create a new folder in the current directory
  os.makedirs(os.path.join(current_dir, 'new_folder'))
except Exception as e:
  print(e)


[WinError 183] Cannot create a file when that file already exists: 'c:\\Users\\Compu\\Crime-Data\\new_folder'
