In [1]:
import pandas as pd

# Load the data
data_path = 'FinalMergedData3.csv'
data = pd.read_csv(data_path)

# Display the first few rows of the dataframe to understand its structure
data.head()


Unnamed: 0.1,Unnamed: 0,State,City,Population,V_crime,Murder,Rape,Robbery,Aggravated assault,Property crime,...,areaname,family_member_count,housing_cost,food_cost,transportation_cost,healthcare_cost,other_necessities_cost,taxes,total_cost,median_family_income
0,7,SOUTH CAROLINA,DUE WEST,1240,0,0.0,0,0,0,18,...,"Abbeville County, SC",2p2c,8148.0,9714.21384,14433.5988,13507.2,6472.18428,4582.35168,63942.8376,53251.199219
1,17,SOUTH CAROLINA,CALHOUN FALLS,1912,14,2.0,0,0,12,31,...,"Abbeville County, SC",2p2c,8148.0,9714.21384,14433.5988,13507.2,6472.18428,4582.35168,63942.8376,53251.199219
2,27,SOUTH CAROLINA,ABBEVILLE,5017,47,9.0,4,4,30,123,...,"Abbeville County, SC",2p2c,8148.0,9714.21384,14433.5988,13507.2,6472.18428,4582.35168,63942.8376,53251.199219
3,35,LOUISIANA,CHURCH POINT,4424,39,1.0,1,0,37,78,...,"Acadia Parish, LA HUD Metro FMR Area",2p2c,8700.0,9114.5706,13866.2988,18289.0836,6454.92192,7041.48408,72526.2192,58169.695312
4,45,LOUISIANA,CROWLEY,12621,158,1.0,8,14,135,528,...,"Acadia Parish, LA HUD Metro FMR Area",2p2c,8700.0,9114.5706,13866.2988,18289.0836,6454.92192,7041.48408,72526.2192,58169.695312


In [4]:
# Remove commas from columns that might contain them and convert to numeric
columns_to_clean = ['Population','V_crime', 'Murder', 'Rape', 'Robbery', 'Aggravated assault', 'Property crime', 'Burglary',	'Larceny-theft','Motor vehicle theft']
for col in columns_to_clean:
    data[col] = pd.to_numeric(data[col].astype(str).str.replace(',', ''), errors='coerce')

# Recalculate the total crimes and crime rate
data['Total Crimes'] = data[['V_crime', 'Murder', 'Rape', 'Robbery', 'Aggravated assault', 'Property crime', 'Burglary','Larceny-theft','Motor vehicle theft']].sum(axis=1)
data['Crime Rate'] = data['Total Crimes'] / data['Population']

# Show updated dataframe
data[['City', 'Population', 'Total Crimes', 'Crime Rate', 'median_family_income']].head()
file_path = 'cleaned_data.csv'
data.to_csv(file_path, index=False)


In [13]:
from scipy.stats import ttest_ind

# Determine the median of the median family incomes
# median_income = data['median_family_income'].median()
median_income = data['median_family_income'].mean()

# Split the data into high and low income groups
high_income_cities = data[data['median_family_income'] > median_income]
low_income_cities = data[data['median_family_income'] <= median_income]

# Perform t-test on the crime rates of the two groups
t_stat, p_value = ttest_ind(high_income_cities['Crime Rate'], low_income_cities['Crime Rate'], equal_var=False)

t_stat, p_value


(nan, nan)

In [14]:
# Check for NaN values in the Crime Rate for both groups
nan_counts_high = high_income_cities['Crime Rate'].isna().sum()
nan_counts_low = low_income_cities['Crime Rate'].isna().sum()

nan_counts_high, nan_counts_low


(1, 0)

In [15]:
# Remove NaN values from the high-income group
high_income_cities_clean = high_income_cities.dropna(subset=['Crime Rate'])

# Rerun t-test
t_stat, p_value = ttest_ind(high_income_cities_clean['Crime Rate'], low_income_cities['Crime Rate'], equal_var=False)

t_stat, p_value


(-0.5474664405770523, 0.5840823997191555)

Median Criteria Results:

    t-statistic: -0.7445
    p-value: 0.4566

The t-test conducted using the median as the criteria for categorizing cities into above and below median family income groups yields a t-statistic of -0.7445, which indicates that the mean crime rate for the above-average income group is lower than that of the below-average income group. However, the negative sign is not very strong, which is further reflected in the p-value.
The p-value of 0.4566 is well above the conventional threshold of 0.05, suggesting that the difference in crime rates between the two groups is not statistically significant. With such a high p-value, we cannot reject the null hypothesis that there is no difference between the two groups’ crime rates based on income level.

Mean Criteria Results:

    t-statistic: -0.5475
    p-value: 0.5841

Similarly, when using the mean as the criteria for categorization, the t-statistic of -0.5475 also indicates a lower crime rate in the above-average income group, but, like the median, the difference is not pronounced.
The p-value of 0.5841 is even higher than that from the median criteria test, further indicating that there is no statistically significant difference in crime rates between cities with above-average and below-average incomes.

______________________________CHI Squared Test____________________________

A Chi-squared test is generally used to examine the relationship between categorical variables, often applied to frequency data in a contingency table. For the hypothesis that cities with higher median family incomes will have lower crime rates, using a Chi-squared test would require us to categorize both the income and crime rate data into bins (e.g., high and low).
We'll split each variable into two categories: above average and below average. Then, we'll create a new contingency table and run the Chi-squared test again using these new categories. 

In [16]:
from scipy.stats import chi2_contingency
# Compute the average values for median_family_income and Crime Rate (%)
average_income = data['median_family_income'].mean()
average_crime_rate = data['Crime Rate'].mean()

# Categorize data based on these averages
data['Income Level Average'] = data['median_family_income'].apply(lambda x: 'Above Average' if x > average_income else 'Below Average')
data['Crime Rate Level Average'] = data['Crime Rate'].apply(lambda x: 'Above Average' if x > average_crime_rate else 'Below Average')

# Create a new contingency table
contingency_table_average = pd.crosstab(data['Income Level Average'], data['Crime Rate Level Average'])

# Perform Chi-squared test on the new table
chi2_average, p_chi2_average, dof_average, expected_average = chi2_contingency(contingency_table_average)

contingency_table_average, chi2_average, p_chi2_average


(Crime Rate Level Average  Above Average  Below Average
 Income Level Average                                  
 Above Average                       539           2012
 Below Average                      1102           2589,
 58.84179061643004,
 1.708732134870535e-14)

Contingency Table:
Income Level Average	Above Average Crime Rate	Below Average Crime Rate
Above Average Income	               539	                    2012
Below Average Income	               1102	                    2589
Chi-squared Test Results:

Chi-squared Statistic: 58.8418
P-value: 1.7087e-14

The Chi-squared test is used to determine if there is a significant association between two categorical variables. Here, the two variables are Income Level Average and Crime Rate Level Average, each split into 'Above Average' and 'Below Average' based on their respective means.

The Chi-squared Statistic of 58.8418 is quite high, which generally indicates a strong association between the two categorical variables. The p-value is extremely low (1.7087e-14), which is much less than the common alpha level of 0.05. This very low p-value indicates that there is a statistically significant difference between the observed frequencies in the categories and the frequencies we would expect to see if there was no association between income level and crime rate.
With the p-value being so low, we reject the null hypothesis of independence between the two variables. This suggests that there is a significant association between income levels and crime rates as categorized. 
As a result, A larger number of cities with above-average income have a below-average crime rate compared to those with an above-average crime rate. Similarly, cities with below-average income have more occurrences of below-average crime rates, but the ratio of above to below-average crime rates is closer than in the above-average income group.

