It's not possible for Pandas to store NaN values in integer columns.

This makes float the obvious default choice for data storage, because as soon as missing value arises Pandas would have to change the data type for the entire column. And missing values arise very often in practice.

As for why this is, it's a restriction inherited from Numpy. Basically, Pandas needs to set aside a particular bit pattern to represent NaN. This is straightforward for floating point numbers and it's defined in the IEEE 754 standard. It's more awkward and less efficient to do this for a fixed-width integer.

TODOS:
- X ---- Join delle tabelle
- Le colonne min_age, avg_age, max_age hanno degli int con valori fuori norma, rimuovere
- Le colonne 15,16,17 hanno un problema di tipi, investigare
- Usare unique per capire quali colonne hanno dati da ripulire (es. inc_char2: Suicide^)

In [6]:
%matplotlib inline
import copy
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

from collections import defaultdict
from scipy.stats import pearsonr

from sklearn.preprocessing import LabelEncoder

# To show all columns
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Add ID column to use as generic index
df = pd.read_csv('incidents.csv', header=0)
df['ID'] = range(1, len(df) + 1)
df.set_index('ID', inplace=True)

# Pretty-print
# display(df.head())

df_poverty = pd.read_csv('povertyByStateYear.csv')
df_district = pd.read_csv('year_state_district_house.csv')

# Add and reorder year column to join poverty table
columns = df.columns
df['year'] = pd.to_datetime(df['date']).dt.year
df = df[columns.insert(columns.get_loc('date') + 1, 'year')]
j_df = pd.merge(df, df_poverty, on=['year','state'], how='left')

# Workaround to join district table
temp = j_df['state']
j_df['state'] = j_df['state'].str.upper()
j2_df = pd.merge(j_df, df_district, on=['year','state','congressional_district'], how='left')
j2_df['state'] = temp

# runs .strip on all object cells in a df
j2_df = j2_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

j2_df['date'] = pd.to_datetime(j2_df['date'], errors='coerce')

# Definisci una lista delle colonne da trattare
columns_to_numeric = [  'min_age_participants', 'avg_age_participants', 'max_age_participants', 
                        'n_participants_child', 'n_participants_teen', 'n_participants_adult',
                        'participant_age1', 'n_males', 'n_females', 'n_killed', 'n_injured',
                        'n_arrested', 'n_unharmed', 'n_participants', 'povertyPercentage',
                        'candidatevotes', 'totalvotes', 'congressional_district',
                        'state_house_district', 'state_senate_district', 'year', 'latitude', 'longitude'] # e i 3 district?

for colonna in columns_to_numeric:
    nan_before = j2_df[colonna].isna().sum()
    j2_df[colonna] = pd.to_numeric(j2_df[colonna], errors='coerce')
    nan_after = j2_df[colonna].isna().sum()
    print(f"{colonna} : {nan_after - nan_before}")

df_with_string = copy.copy(j2_df)

print("Correlation Matrix")
# display(j2_df.corr(numeric_only=True))
# print()
le = LabelEncoder()
columns_not_numeric = [ 'state', 'city_or_county', 'address', 'participant_age_group1',
                        'participant_gender1', 'notes', 'incident_characteristics1',
                        'incident_characteristics2', 'party']
for column in columns_not_numeric:
    j2_df[column] = le.fit_transform(j2_df[column])
# display(j2_df.corr(numeric_only=True))

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop
def get_top_abs_correlations(df, n=5):
    au_corr = df.corr(numeric_only=True).abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
j2_df2 = j2_df[j2_df.columns.difference(['date'])]
print("-------------------")
print(get_top_abs_correlations(j2_df2, 30))
print("-------------------")

display(j2_df.head())
display(df_with_string.head())
j2_df.info()


  df = pd.read_csv('incidents.csv', header=0)


min_age_participants : 5753
avg_age_participants : 5889
max_age_participants : 5885
n_participants_child : 5
n_participants_teen : 7
n_participants_adult : 3
participant_age1 : 0
n_males : 0
n_females : 0
n_killed : 0
n_injured : 0
n_arrested : 0
n_unharmed : 0
n_participants : 0
povertyPercentage : 0
candidatevotes : 0
totalvotes : 0
congressional_district : 0
state_house_district : 0
state_senate_district : 0
year : 0
latitude : 0
longitude : 0
Correlation Matrix
Top Absolute Correlations
-------------------
n_participants_child       n_participants_teen       0.999999
n_participants_adult       n_participants_child      0.998271
avg_age_participants       participant_age1          0.944480
max_age_participants       participant_age1          0.926224
min_age_participants       participant_age1          0.874960
candidatevotes             totalvotes                0.846980
n_males                    n_participants            0.823163
n_participants_adult       n_participants_teen    

Unnamed: 0,date,year,state,city_or_county,address,latitude,longitude,congressional_district,state_house_district,state_senate_district,participant_age1,participant_age_group1,participant_gender1,min_age_participants,avg_age_participants,max_age_participants,n_participants_child,n_participants_teen,n_participants_adult,n_males,n_females,n_killed,n_injured,n_arrested,n_unharmed,n_participants,notes,incident_characteristics1,incident_characteristics2,povertyPercentage,party,candidatevotes,totalvotes
0,2015-05-02,2015,14,5425,166064,39.8322,-86.2492,7.0,94.0,33.0,19.0,0,1,19.0,19.0,19.0,,,,1.0,0.0,0,1,0.0,0.0,1.0,76617,41,90,12.3,3,,
1,2017-04-03,2017,38,5685,107657,41.6645,-78.7856,5.0,,,62.0,0,1,62.0,62.0,62.0,,,,1.0,0.0,1,0,0.0,0.0,1.0,124178,40,84,10.5,3,,
2,2016-11-05,2016,22,3022,114611,42.419,-83.0393,14.0,4.0,2.0,,3,3,,,,,,,,,0,1,0.0,1.0,2.0,812,41,90,11.0,0,244135.0,310974.0
3,2016-10-15,2016,8,12081,5108,38.903,-76.982,1.0,,,,0,1,248339.0,707477.0,761203.0,,,,1.0,0.0,0,1,0.0,0.0,2.0,136435,41,90,14.9,3,,
4,2030-06-14,2030,38,9082,145665,40.4621,-80.0308,14.0,,,,0,1,,,,,,,1.0,0.0,0,1,0.0,1.0,2.0,136435,41,27,,3,,


Unnamed: 0,date,year,state,city_or_county,address,latitude,longitude,congressional_district,state_house_district,state_senate_district,participant_age1,participant_age_group1,participant_gender1,min_age_participants,avg_age_participants,max_age_participants,n_participants_child,n_participants_teen,n_participants_adult,n_males,n_females,n_killed,n_injured,n_arrested,n_unharmed,n_participants,notes,incident_characteristics1,incident_characteristics2,povertyPercentage,party,candidatevotes,totalvotes
0,2015-05-02,2015,Indiana,Indianapolis,Lafayette Road and Pike Plaza,39.8322,-86.2492,7.0,94.0,33.0,19.0,Adult 18+,Male,19.0,19.0,19.0,,,,1.0,0.0,0,1,0.0,0.0,1.0,Teen wounded while walking - Security guard at...,Shot - Wounded/Injured,,12.3,,,
1,2017-04-03,2017,Pennsylvania,Kane,5647 US 6,41.6645,-78.7856,5.0,,,62.0,Adult 18+,Male,62.0,62.0,62.0,,,,1.0,0.0,1,0,0.0,0.0,1.0,shot self after accident,"Shot - Dead (murder, accidental, suicide)",Suicide^,10.5,,,
2,2016-11-05,2016,Michigan,Detroit,6200 Block of East McNichols Road,42.419,-83.0393,14.0,4.0,2.0,,,,,,,,,,,,0,1,0.0,1.0,2.0,1 inj.,Shot - Wounded/Injured,,11.0,DEMOCRAT,244135.0,310974.0
3,2016-10-15,2016,District of Columbia,Washington,"1000 block of Bladensburg Road, NE",38.903,-76.982,1.0,,,,Adult 18+,Male,248339.0,707477.0,761203.0,,,,1.0,0.0,0,1,0.0,0.0,2.0,,Shot - Wounded/Injured,,14.9,,,
4,2030-06-14,2030,Pennsylvania,Pittsburgh,California and Marshall Avenues,40.4621,-80.0308,14.0,,,,Adult 18+,Male,,,,,,,1.0,0.0,0,1,0.0,1.0,2.0,,Shot - Wounded/Injured,"Drive-by (car to street, car to car)",,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239677 entries, 0 to 239676
Data columns (total 33 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   date                       239677 non-null  datetime64[ns]
 1   year                       239677 non-null  int32         
 2   state                      239677 non-null  int64         
 3   city_or_county             239677 non-null  int64         
 4   address                    239677 non-null  int64         
 5   latitude                   231754 non-null  float64       
 6   longitude                  231754 non-null  float64       
 7   congressional_district     227733 non-null  float64       
 8   state_house_district       200905 non-null  float64       
 9   state_senate_district      207342 non-null  float64       
 10  participant_age1           147379 non-null  float64       
 11  participant_age_group1     239677 non-null  int64   

In [7]:
incidents_duplicated_rows=j2_df.duplicated()
print("The total number of duplicate rows in the Incidents dataset is", incidents_duplicated_rows.sum())
# incidents[incidents_duplicated_rows]
j2_df=j2_df.drop_duplicates()

The total number of duplicate rows in the Incidents dataset is 254
