

The dataset on *`Communities and Crime`* that I chose for my project is available in a GitHub repository. <br> You can fetch the data using the code provided on the dataset's [website](https://archive.ics.uci.edu/dataset/183/communities+and+crime):

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold

In [2]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
communities_and_crime = fetch_ucirepo(id=183)

In [4]:
# data (as pandas dataframes)
X = communities_and_crime.data.features
y = communities_and_crime.data.targets  #target variable = total number of violent crimes per 100K population (numeric - decimal)

X.replace('?', np.nan, inplace=True) #sobstitute "?" with missing values, as they are indeed missing values
X.rename(columns={'county': 'country'}, inplace=True) #correcting a spell error in a variable name

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.replace('?', np.nan, inplace=True) #sobstitute "?" with missing values, as they are indeed missing values
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.rename(columns={'county': 'country'}, inplace=True) #correcting a spell error in a variable name


In [5]:
print(X)

      state country community        communityname  fold  population  \
0         8     NaN       NaN         Lakewoodcity     1        0.19   
1        53     NaN       NaN          Tukwilacity     1        0.00   
2        24     NaN       NaN         Aberdeentown     1        0.00   
3        34       5     81440  Willingborotownship     1        0.04   
4        42      95      6096    Bethlehemtownship     1        0.01   
...     ...     ...       ...                  ...   ...         ...   
1989     12     NaN       NaN    TempleTerracecity    10        0.01   
1990      6     NaN       NaN          Seasidecity    10        0.05   
1991      9       9     80070        Waterburytown    10        0.16   
1992     25      17     72600          Walthamcity    10        0.08   
1993      6     NaN       NaN          Ontariocity    10        0.20   

      householdsize  racepctblack  racePctWhite  racePctAsian  ...  \
0              0.33          0.02          0.90          0.12  ..

The dataset comprises 1994 observations and 127 variables, detailed on the dataset's [website](https://archive.ics.uci.edu/dataset/183/communities+and+crime).<br>
Because of time and memory constraints around 17 variables were selected using the following method:

-----------------------------------------
**Select the variables that are more relevant to the target**

1) Check for Na's and delete the variables with high Na's

In [6]:
X = X.iloc[:, [0] + list(range(5, X.shape[1]))]
# Count the missing values in each column
na_counts = X.isna().sum() / len(X) * 100

# Filter and print only the columns that have missing values
na_counts = na_counts[na_counts > 0]
print(na_counts)

OtherPerCap              0.050150
LemasSwornFT            84.002006
LemasSwFTPerPop         84.002006
LemasSwFTFieldOps       84.002006
LemasSwFTFieldPerPop    84.002006
LemasTotalReq           84.002006
LemasTotReqPerPop       84.002006
PolicReqPerOffic        84.002006
PolicPerPop             84.002006
RacialMatchCommPol      84.002006
PctPolicWhite           84.002006
PctPolicBlack           84.002006
PctPolicHisp            84.002006
PctPolicAsian           84.002006
PctPolicMinor           84.002006
OfficAssgnDrugUnits     84.002006
NumKindsDrugsSeiz       84.002006
PolicAveOTWorked        84.002006
PolicCars               84.002006
PolicOperBudg           84.002006
LemasPctPolicOnPatr     84.002006
LemasGangUnitDeploy     84.002006
PolicBudgPerPop         84.002006
dtype: float64


In [7]:
#Let's delete the variables with high Na count before reducing the data
columns_to_remove = [
    'OtherPerCap', 'LemasSwornFT', 'LemasSwFTPerPop', 'LemasSwFTFieldOps', 'LemasSwFTFieldPerPop',
    'LemasTotalReq', 'LemasTotReqPerPop', 'PolicReqPerOffic', 'PolicPerPop',
    'RacialMatchCommPol', 'PctPolicWhite', 'PctPolicBlack', 'PctPolicHisp',
    'PctPolicAsian', 'PctPolicMinor', 'OfficAssgnDrugUnits', 'NumKindsDrugsSeiz',
    'PolicAveOTWorked', 'PolicCars', 'PolicOperBudg', 'LemasPctPolicOnPatr',
    'LemasGangUnitDeploy', 'PolicBudgPerPop'
]
X = X.drop(columns=columns_to_remove, axis=1)

2)  Delete the variables with *low variance* and the *high correlated* ones

In [8]:
# Remove features with low variance
selector = VarianceThreshold(threshold=(.01 * (1 - .01)))
X_high_variance = selector.fit_transform(X)
df = pd.DataFrame(X_high_variance, columns=X.columns[selector.get_support()])

# Calculate the correlation matrix
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features with correlation greater than 0.95
df_reduced = df.drop(columns=to_drop)

In [9]:
print(df_reduced)

      state  population  householdsize  racepctblack  racePctWhite  \
0       8.0        0.19           0.33          0.02          0.90   
1      53.0        0.00           0.16          0.12          0.74   
2      24.0        0.00           0.42          0.49          0.56   
3      34.0        0.04           0.77          1.00          0.08   
4      42.0        0.01           0.55          0.02          0.95   
...     ...         ...            ...           ...           ...   
1989   12.0        0.01           0.40          0.10          0.87   
1990    6.0        0.05           0.96          0.46          0.28   
1991    9.0        0.16           0.37          0.25          0.69   
1992   25.0        0.08           0.51          0.06          0.87   
1993    6.0        0.20           0.78          0.14          0.46   

      racePctAsian  racePctHisp  agePct12t21  agePct12t29  agePct16t24  ...  \
0             0.12         0.17         0.34         0.47         0.29  ...   
1

The ***variables highly correlated to the target and that influence the most the target*** are, as seen in the results, ***82*** out of the 127 total ones

3) Select the variables based on the objective of the project

In [10]:
# For an analysis from a sociological and socio-economic perspective, the reduced dataframe will be considered, excluding columns related to police characteristics and those tagged as non-predictive.
# These columns serve as identifiers or methodological tools rather than contributing to the predictive modeling of crime rates.
# This approach will focus on the remaining 16 variables:

columns_to_keep = ['PctYoungKids2Par', 'PctTeen2Par', 'PctEmploy', 'PctPopUnderPov', 'PctBSorMore',
                      'pctWInvInc', 'PctSpeakEnglOnly', 'NumIlleg', 'PctLargHouseFam',
                      'PctNotSpeakEnglWell', 'PctFam2Par', 'PctWorkMom', 'medIncome',
                      'PctUnemployed', 'pctWPubAsst', 'state']

data = X[columns_to_keep]
target = y

rename_dict = {
    'PctYoungKids2Par': 'YoungKids_2Par',
    'PctTeen2Par': 'Teen_2Par',
    'PctEmploy': 'Employed',
    'PctPopUnderPov': 'Below_Poverty',
    'PctBSorMore': 'Degree_BS_Or_More',
    'PctWorkMom': 'Working_mom',
    'PctSpeakEnglOnly': 'Speak_Eng_Only',
    'NumIlleg': 'Illegitimate_Births',
    'PctLargHouseFam': 'Large_Families',
    'PctNotSpeakEnglWell': 'Poor_English',
    'PctFam2Par': 'Families_2Parents',
    'pctWInvInc': 'Inc_from_inv',
    'medIncome': 'Median_Income',
    'PctUnemployed': 'Unemployment',
    'pctWPubAsst': 'Welfare_Public_Assist',
    'state': 'State'
}

data = data.rename(columns=rename_dict)

In [11]:
print(data)

      YoungKids_2Par  Teen_2Par  Employed  Below_Poverty  Degree_BS_Or_More  \
0               0.61       0.56      0.68           0.19               0.48   
1               0.60       0.39      0.73           0.24               0.30   
2               0.43       0.43      0.58           0.27               0.19   
3               0.83       0.65      0.71           0.10               0.31   
4               0.89       0.85      0.65           0.06               0.33   
...              ...        ...       ...            ...                ...   
1989            0.67       0.59      0.71           0.16               0.65   
1990            0.69       0.70      0.77           0.32               0.22   
1991            0.47       0.47      0.46           0.31               0.21   
1992            0.75       0.71      0.57           0.16               0.42   
1993            0.64       0.60      0.58           0.35               0.16   

      Inc_from_inv  Speak_Eng_Only  Illegitimate_Bi

In [12]:
print(data.isna().sum())
#No NA so nothing needs to be adjusted

YoungKids_2Par           0
Teen_2Par                0
Employed                 0
Below_Poverty            0
Degree_BS_Or_More        0
Inc_from_inv             0
Speak_Eng_Only           0
Illegitimate_Births      0
Large_Families           0
Poor_English             0
Families_2Parents        0
Working_mom              0
Median_Income            0
Unemployment             0
Welfare_Public_Assist    0
State                    0
dtype: int64


Lastly, the dataframe has been saved as a *`.csv`* file to be imported in RStudio, where the analysis was continued.  

In [13]:
#Write "data" and the target variable into a dataframe
df_crime = data.copy()
df_crime['target'] = target


#Save the DataFrame to a CSV file such that we can download it and import into the R enviroenment
df_crime.to_csv('crime_data.csv', index=False)

-----------------------------------------