<h1 style="text-align:center;">Mass Shootings in the USA</h1>

<p style="text-align:center;">Laia Mogas i Pau Mateo</p>
<p style="text-align:center;">Information Visualization - GCED</p>

## Requirements

In [28]:
# >pip install pandas
# >pip install altair
# >pip install json

import pandas as pd
import altair as alt
import json
from collections import defaultdict
from random import choice

## Obtaining data
The main data sources we used in this project are:
- https://www2.census.gov/
- https://www.gunviolencearchive.org/reports

From theese sources we obtain the following datasets:
- Mass shootings (january 2014 to september 2024)
- School incidents (january 2022 to 2024)
- County population


## Preprocessing and data cleaning - County Population
We obtained the conty popultion data with two separate csv files: one correspondig to the years 2014-2019, an another one for 2020 and later. These datasets contained a lot of information, but we only needed the population estimate foer each county every year, so we eliminated the unnecesary columns. We merged these two datasets with pytohn pandas. Then noticed that the state Connecticut had a county redistribuition in 2020, so we decided to keep the latest cofniguration of counties and estimate uniformly the population of those counties for years before 2020. The code we used for this preprocessing is the following:

In [29]:
pop2010_20 = pd.read_csv('datasets/co-est2010-2020_alldata.csv', encoding='latin1')
pop2020_23 = pd.read_csv('datasets/co-est2020-2023_alldata.csv', encoding='latin1')

########## column projection ##########
columns2010 = ['STATE', 'COUNTY','STNAME','CTYNAME']
columns2010.extend(['POPESTIMATE201'+str(i) for i in range(10)])
columns2020 = ['STATE', 'COUNTY','STNAME','CTYNAME']
columns2020.extend(['POPESTIMATE202'+str(i) for i in range(4)])

pop2010_20['FIPS'] = pop2010_20.apply(lambda row: f"{int(row['STATE']):02d}{int(row['COUNTY']):03d}", axis=1)
pop2020_23['FIPS'] = pop2020_23.apply(lambda row: f"{int(row['STATE']):02d}{int(row['COUNTY']):03d}", axis=1)


##########     merge     ##########
df_population = pd.merge(pop2010_20[columns2010], pop2020_23[columns2020], on=['STATE', 'COUNTY'])

df_population['STATE'] = df_population['STATE'].astype(str)
df_population['COUNTY'] = df_population['COUNTY'].astype(str)


########## obraining FIPS ##########
df_population['FIPS'] = df_population.apply(lambda row: f"{int(row['STATE']):02d}{int(row['COUNTY']):03d}", axis=1)
df_population['STNAME'] = df_population['STNAME_x']
df_population['CTYNAME'] = df_population['CTYNAME_x']

########## ordering columns ##########
new_cols_order = ['STNAME', 'CTYNAME', 'FIPS', 'STATE', 'COUNTY'] + ['POPESTIMATE20'+f'{i:02d}' for i in range(14,24)]
df_population = df_population[new_cols_order]

In [30]:
##### checking changed counties #####

l1 = []
for c in pop2010_20['FIPS']:
    if c not in df_population['FIPS'].values:
        l1.append(c)

l2 = []
for c in pop2020_23['FIPS']:
    if c not in df_population['FIPS'].values:
        l2.append(c)

print(l1)
print(l2)

['09001', '09003', '09005', '09007', '09009', '09011', '09013', '09015']
['09110', '09120', '09130', '09140', '09150', '09160', '09170', '09180', '09190']


We see that indeed, the counties from the Connecticut state are different in the two datasets.

In [31]:
##### dealing with state '09' (Connecticut) ######

fips_connecticut = pop2020_23[pop2020_23['STATE']==9]['FIPS'].values
N = len(fips_connecticut)

totalpop = defaultdict()


for i in range(14,24):
    year = 'POPESTIMATE20'+ f'{i:02d}'
    #calculate total population in state '09' in this year
    if i < 20:
        totalpop[year] = sum(pop2010_20[pop2010_20['STATE'] == 9][year].values)
    else:
        totalpop[year] = sum(pop2020_23[pop2020_23['STATE'] == 9][year].values)


for fips in fips_connecticut:
    new_row = pd.Series({
        'STNAME': 'Connecticut', 
        'CTYNAME': 'connecticut_county_unknown',
        'FIPS': fips, 
        'STATE': 9, 
        'COUNTY': fips[2:],
        'POPESTIMATE2014': totalpop['POPESTIMATE2014'] / N,
        'POPESTIMATE2015': totalpop['POPESTIMATE2015'] / N,
        'POPESTIMATE2016': totalpop['POPESTIMATE2016'] / N,
        'POPESTIMATE2017': totalpop['POPESTIMATE2017'] / N,
        'POPESTIMATE2018': totalpop['POPESTIMATE2018'] / N,
        'POPESTIMATE2019': totalpop['POPESTIMATE2019'] / N,
        'POPESTIMATE2020': totalpop['POPESTIMATE2020'] / N,
        'POPESTIMATE2021': totalpop['POPESTIMATE2021'] / N,
        'POPESTIMATE2022': totalpop['POPESTIMATE2022'] / N,
        'POPESTIMATE2023': totalpop['POPESTIMATE2023'] / N
    })
    df_population.loc[len(df_population)] = new_row


In [32]:
#### save partial dataset ###
df_population.to_csv('datasets/CountyPopulationAllYears.csv', index=False)

Later on, we will have to redistribute the Mass Shootings and School incidents that occured in the counties that we've just eliminated.

## Preprocessing and data cleaning - Mass Shootings + School incidents

To clean the Gun Violence data we used Open Refine followed by python.
#### Concatenating mass shootings csv
The raw data from the Gun Violence Archaive is splitted into various datasets. To add everything into a same csv file, we simply used Open Refine and opened all the csv when starting a new project. This way, the multiple csv are automatiqually concatenated, as they all have the same columns. Before doing that, we had to eliminate some of the duplicate rows in the data, as the file `MassShootings_2021.csv` overlaps with the file `GunVioleceAllYears.csv`. Again, we used Open Refine to do so, by making a time facet of the file `MassShootings_2021.csv` and selectig the rows that were already in the `GunVioleceAllYears.csv` file, and then eliminating them.

#### Tranformations
We applied some basic transformations to the dataset, mainly just changing the types of some columns, such as *Incident Date* to date, and numerical columns into intiger. 

#### Obtaining counties with OSM
To answer accuratelly the questions of the projecy, we needed the exact county where each incident occured. To obtain those, we first used Open Refine to obtain some json information from Open Street Maps, with the methos "Add column by fetching URLs". The command we used is:

With this, we obtained some json information for each row, for example:

In [33]:
[{"place_id":323075484,
"licence":"Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type":"relation",
"osm_id":1180533,
"lat":"38.6280278",
"lon":"-90.1910154",
"class":"boundary",
"type":"administrative",
"place_rank":12,
"importance":0.7078550940195126,
"addresstype":"independent_city",
"name":"Saint Louis",
"display_name":"Saint Louis, Missouri, United States",
"boundingbox":["38.5323215","38.7743018","-90.3206525","-90.1641941"]}]

[{'place_id': 323075484,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright',
  'osm_type': 'relation',
  'osm_id': 1180533,
  'lat': '38.6280278',
  'lon': '-90.1910154',
  'class': 'boundary',
  'type': 'administrative',
  'place_rank': 12,
  'importance': 0.7078550940195126,
  'addresstype': 'independent_city',
  'name': 'Saint Louis',
  'display_name': 'Saint Louis, Missouri, United States',
  'boundingbox': ['38.5323215', '38.7743018', '-90.3206525', '-90.1641941']}]

From this information we wantet to obtain the county for each incident. However, we weren't capable of doing so with Open Refine, as the URL used didn't return the names of the counties in the same format as the dataest form census.gov, so we had to process again this informatino with pytohn.
We defined some functions that try to find a county name for each incident that exists in the population dataset obtained from census.gov. The functions iterate through the names in the key `"display_name"` from the json column (as there are some cases where the Township or some other information is also in the json code, so the county is not always in the same position) and applies some transformations to them and checks if the resulting name it's a county.

In [34]:
names_counties = defaultdict(set)  # names of the counties for each state

for state in df_population['STNAME'].values:
    names_counties[state] = set(df_population[df_population['STNAME']==state]['CTYNAME'].values)


def transforms(name: str) -> str:
    if 'Clarke' in name: return 'Clarke County'
    if 'Connecticut' in name: return 'connecticut_county_unknown'
    name = name.replace('Saint', 'St.').replace('Nashville-', '').replace('Vista', 'Buena Vista').replace('Compton', 'Los Angeles County')
    name = name.replace('Monterey Park', 'Monterey County')
    return name.strip()


def try_find(name, state) -> str | None:
    '''tries to find the countie corresponding to `name` by 
    appliying transformations and trying diferent formats'''
    if name in names_counties[state]: return name

    if transforms(name) in names_counties[state]: return transforms(name)

    #try county
    n = transforms(name)
    if 'County' not in n: n += ' County'
    if n in names_counties[state]: return n

    #try parish
    n = transforms(name)
    if 'Parish' not in n: n += ' Parish'
    if n in names_counties[state]: return n

    #try city
    n = transforms(name)
    if 'city' not in n: n += ' city'
    if n in names_counties[state]: return n

    #try Municipality
    n = transforms(name)
    if 'city' not in n: n += ' Municipality'
    if n in names_counties[state]: return n
    
    else: return None


def extract_county(row) -> str | None:
    '''tries to extract the corresponding countie for the given row'''
    json_data = row['json']
    state = row['State'].strip()

    places = json.loads(json_data)
    if places:
        names = places[0].get("display_name", "")
        # Cerca "County", "Parish", o altres entitats equivalents en display_name
        
        delims = [",", ")", "("]

        for delim in delims:
            names = "_".join(names.split(delim))
        names = names.split("_")
        names = [n.strip() for n in names if n != '']

        for n in names:
            cname = try_find(n, state)
            if cname: return cname
            else: continue
    
        return None
    

df_school_incidents = pd.read_csv('datasets/Schoolincidents-json.csv')
df_shootings = pd.read_csv("datasets/MassShootingsComplete-json.csv")

df_shootings['County'] = df_shootings.apply(extract_county, axis=1)
df_school_incidents['County'] = df_school_incidents.apply(extract_county, axis=1)

#remove rows with null or unknown county
df_shootings = df_shootings.dropna(subset=['County'])
df_school_incidents = df_school_incidents.dropna(subset=['County'])

With this transformations we obtain a county for **all rows** that had a not null value in the json column (for 70 rows amongst 5k, the URLs from Open Refine extracted just `"[]"`).

### Obtaining FIPS codes

Obtaining the FIPS codes for each county facilitated us working with altair's cloropleth charts. At this point, it was easy to obtain the FIPS codes for each incident, as we already had them in the County population csv. A simple function was enough:

In [35]:
def get_fips(row) -> str | None:
    """
    Returns the FIPS code for the given row by searching
    it in the population csv
    """

    county_value = row['County']

    # connecticut separate case:
    connecticut_fips = ['09110', '09120', '09130', '09140', '09150', '09160', '09170', '09180', '09190']
    if county_value == 'connecticut_county_unknown':
        fips = choice(connecticut_fips)  # we map the incident into a random county
        return f'{int(fips):02d}'

    if county_value is not None:
        county_value = county_value.strip()
    else:
        county_value = ""  # Si és None, assignar una cadena buida per evitar errors
    
    fips_value = df_population[(df_population['STNAME'] == row['State']) & (df_population['CTYNAME'] == county_value)]['FIPS']
    
    # Retornar el primer valor (ja que és únic)
    return f'{int(fips_value.iloc[0]):02d}' if not fips_value.empty else None


df_shootings['FIPS'] = df_shootings.apply(get_fips, axis=1)
df_shootings.dropna(inplace=True)
df_shootings['FIPS'] = df_shootings['FIPS'].apply(lambda x: f"{int(x):05d}") # to ensure length 5


df_school_incidents['FIPS'] = df_school_incidents.apply(get_fips, axis=1)
df_school_incidents.dropna(inplace=True)
df_school_incidents['FIPS'] = df_school_incidents['FIPS'].apply(lambda x: f"{int(x):05d}")

In [36]:
#### save partial datasets ###
# df_shootings.to_csv('datasets/MassShootingsComplete_FIPS.csv', index=False)
# df_school_incidents.to_csv('datasets/SchoolIncidents_FIPS.csv', index=False)

Now, we just need to do some last agregations and transformations so that we can put together all this information n only one csv file to simplify the code for the altair charts.

### Final transformations

In [37]:
def split_date(df):
    df['Incident Date'] = pd.to_datetime(df['Incident Date'])
    df['Year'] = df['Incident Date'].dt.year
    df['Month'] = df['Incident Date'].dt.month
    df.drop(labels='Incident Date', axis=1, inplace=True)
    return df

#### County population

In [38]:
df_population.drop(['STATE', 'COUNTY'], inplace=True, axis=1)

df_population.rename(inplace=True, columns={
    'STNAME': 'State',
    'CTYNAME': 'County',
    'POPESTIMATE2014': '2014',
    'POPESTIMATE2015': '2015',
    'POPESTIMATE2016': '2016',
    'POPESTIMATE2017': '2017',
    'POPESTIMATE2018': '2018',
    'POPESTIMATE2019': '2019',
    'POPESTIMATE2020': '2020',
    'POPESTIMATE2021': '2021',
    'POPESTIMATE2022': '2022',
    'POPESTIMATE2023': '2023'    
})


df_population = pd.melt(df_population, id_vars=['FIPS','State','County'], value_vars=[str(i) for i in range(2014,2024)])
df_population.rename(inplace=True, columns={
    'variable': 'Year',
    'value': 'Population'
})

In [39]:
df_population.columns

Index(['FIPS', 'State', 'County', 'Year', 'Population'], dtype='object')

#### Mass shootings

In [40]:
df_shootings = split_date(df_shootings)
df_shootings = df_shootings[df_shootings['Year']<2024]
df_shootings['FIPS'] = df_shootings['FIPS'].astype(int)
df_shootings = df_shootings.groupby(['FIPS', 'County', 'State', 'Year', 'Month']).size().reset_index(name='Shootings')

#### School incidents

In [41]:
df_school_incidents.drop(['Address', 'Business/Location Name'], axis=1, inplace=True)
df_school_incidents = split_date(df_school_incidents)
df_school_incidents = df_school_incidents.groupby(['FIPS', 'County', 'State', 'Year', 'Month']).size().reset_index(name='School Incidents')

### Final merge

In [None]:
df_population['Year'] = df_population['Year'].apply(str)
df_population['FIPS'] = df_population['FIPS'].astype(int)

df_shootings['Year'] = df_shootings['Year'].astype(str)

df_school_incidents['Year'] = df_school_incidents['Year'].apply(str)
df_school_incidents['FIPS'] = df_school_incidents['FIPS'].astype(int)



### Merge mass shootings and school incidents
df_complete = pd.merge(df_shootings, df_school_incidents, on=['FIPS', 'County', 'State', 'Year', 'Month'], how='outer')
df_complete = pd.DataFrame(df_complete.fillna(0))

print(df_complete['Month'].value_counts())

# join population
# Add month to population
df_population['Month'] = 1
df_population = pd.concat([df_population] * 12, ignore_index=True)
df_population['Month'] = df_population.groupby(['FIPS', 'Year']).cumcount() + 1

# add months to poplation
df_complete = pd.merge(df_population,df_complete, on=['FIPS', 'County', 'State', 'Year'], how='left')
df_complete = df_complete.fillna(0)
df_complete['Month'] = df_complete['Month_x']
# df_complete.drop(['Month_x', 'Month_y'], axis=1, inplace=True)
df_complete['FIPS'] = df_complete['FIPS'].apply(lambda x: f'{x:02d}')

# compute ratio
df_complete['Ratio County'] = df_complete['Shootings'] / df_complete['Population'] * 1000000

Month
9     545
5     522
4     480
8     474
6     458
7     448
10    448
2     424
1     404
3     398
11    376
12    329
Name: count, dtype: int64


In [None]:
print(df_complete['Month_x'].value_counts())

Month
1      34150
49     34150
93     34150
91     34150
89     34150
       ...  
228       10
176       10
226       10
224       10
288       10
Name: count, Length: 288, dtype: int64


In [55]:
### save to csv file
df_complete.to_csv('datasets/GunViolenceCompleteData.csv')

The preprocessing is done! We are ready to start making visualizations! 🚀🚀

## Questions

In the following sections we show the steps we followed in the design of the visualizations for each question. At the end of each section we provide the final visualization together with a breaf explanation of the decisions taken and the problems we encountered.

In [105]:
df_GunViolence_csv = pd.read_csv('datasets/GunViolenceCompleteData.csv')

<h3 style="color:darkblue"> Q1: What are the states with large number of mass shootings per citizen?

In [7]:
elections_republican = {'Alabama': 4, 'Alaska': 4, 'Arizona': 3, 'Arkansas': 4, 'California': 0, 'Colorado': 0, 'Connecticut': 0, 'Delaware': 0, 'District of Columbia': 0, 'Florida': 3, 'Georgia': 3, 'Hawaii': 0, 'Idaho': 4, 'Illinois': 0, 'Indiana': 4, 'Iowa': 3, 'Kansas': 4, 'Kentucky': 4, 'Louisiana': 4, 'Maine': 0, 'Maryland': 0, 'Massachusetts': 0, 'Michigan': 2, 'Minnesota': 0, 'Mississippi': 4, 'Missouri': 4, 'Montana': 4, 'Nebraska': 4, 'Nevada': 1, 'New Hampshire': 0, 'New Jersey': 0, 'New Mexico': 0, 'New York': 0, 'North Carolina': 4, 'North Dakota': 4, 'Ohio': 3, 'Oklahoma': 4, 'Oregon': 0, 'Pennsylvania': 2, 'Rhode Island': 0, 'South Carolina': 4, 'South Dakota': 4, 'Tennessee': 4, 'Texas': 4, 'Utah': 4, 'Vermont': 0, 'Virginia': 0, 'Washington': 0, 'West Virginia': 4, 'Wisconsin': 2, 'Wyoming': 4}

#print(df_GunViolence.loc[df_GunViolence['State'] == 'District of Columbia'])
df_GunViolence = df_GunViolence_csv

# agrupar per any
df_GunViolence = (
    df_GunViolence.groupby(['FIPS', 'County', 'State', 'Year'])
    .agg({
        'Population': 'first',  # Tomar el primer valor disponible de Population
        'Shootings': 'sum'      # Sumar los valores de Shootings
    })
    .reset_index()
)
#print(df_GunViolence.loc[df_GunViolence['State'] == 'District of Columbia'].head())
#print(df_GunViolence.loc[df_GunViolence['State'] == 'Louisiana'].head())


# agrupar per state
df_GunViolence = df_GunViolence.groupby(['State', 'Year'])[['Shootings', 'Population']].sum().reset_index()
df_GunViolence['Ratio State'] = df_GunViolence['Shootings'] / df_GunViolence['Population'] * 1000000 # ratio entre shootings i population

print('post agrupar per state')
print(df_GunViolence.loc[df_GunViolence['State'] == 'District of Columbia'].head())
print(df_GunViolence.loc[df_GunViolence['State'] == 'South Dakota'].head())

# mean ratio for years
df_GunViolence = df_GunViolence.groupby('State')['Ratio State'].mean().reset_index()

print(df_GunViolence.head())

elections_republican = {'Alabama': 4, 'Alaska': 4, 'Arizona': 3, 'Arkansas': 4, 'California': 0, 'Colorado': 0, 'Connecticut': 0, 'Delaware': 0, 'District of Columbia': 0, 'Florida': 3, 'Georgia': 3, 'Hawaii': 0, 'Idaho': 4, 'Illinois': 0, 'Indiana': 4, 'Iowa': 3, 'Kansas': 4, 'Kentucky': 4, 'Louisiana': 4, 'Maine': 0, 'Maryland': 0, 'Massachusetts': 0, 'Michigan': 2, 'Minnesota': 0, 'Mississippi': 4, 'Missouri': 4, 'Montana': 4, 'Nebraska': 4, 'Nevada': 1, 'New Hampshire': 0, 'New Jersey': 0, 'New Mexico': 0, 'New York': 0, 'North Carolina': 4, 'North Dakota': 4, 'Ohio': 3, 'Oklahoma': 4, 'Oregon': 0, 'Pennsylvania': 2, 'Rhode Island': 0, 'South Carolina': 4, 'South Dakota': 4, 'Tennessee': 4, 'Texas': 4, 'Utah': 4, 'Vermont': 0, 'Virginia': 0, 'Washington': 0, 'West Virginia': 4, 'Wisconsin': 2, 'Wyoming': 4}
mapping = {
    4: 'Republicans won the last 4 elections',
    3: 'Republicans won 3 of the last 4 elections',
    2: 'Republicans won 2 and Democrats won 2 of the last 4 elections',
    1: 'Democrats won 3 of the last 4 elections',
    0: 'Democrats won the last 4 elections'
}

for state in elections_republican.keys():
    elections_republican[state] = mapping[elections_republican[state]]


df_GunViolence['Republican Vote'] = df_GunViolence['State'].map(elections_republican)

df_GunViolence[df_GunViolence['State'] == 'District of Columbia'].head()


post agrupar per state
                   State  Year  Shootings  Population  Ratio State
80  District of Columbia  2014       72.0   1327206.0    54.249303
81  District of Columbia  2015       36.0   1354028.0    26.587338
82  District of Columbia  2016       60.0   1375152.0    43.631540
83  District of Columbia  2017       60.0   1394158.0    43.036729
84  District of Columbia  2018       72.0   1408294.0    51.125688
            State  Year  Shootings  Population  Ratio State
410  South Dakota  2014       12.0   1699340.0     7.061565
411  South Dakota  2015       12.0   1709326.0     7.020311
412  South Dakota  2016        0.0   1727386.0     0.000000
413  South Dakota  2017        0.0   1747464.0     0.000000
414  South Dakota  2018        0.0   1758772.0     0.000000
        State  Ratio State
0     Alabama    13.745900
1      Alaska     4.086223
2     Arizona     4.498215
3    Arkansas    10.718316
4  California     6.390838


Unnamed: 0,State,Ratio State,Republican Vote
8,District of Columbia,67.066225,Democrats won the last 4 elections


In order to display the ratios per state, the most straightforward way is using a barplot. We opted for using a horizontal barplot, so the labels can be properly read. 

On the other hand, as we need to quickly find the states with the highest mass shootings per citizen, we sort the bars per value.


Additionally, we wanted to encode the majority vote intention of the state in the last 4 elections, given the relevance in the USA elections and society of the concepts of red, blue and swing states.

In [8]:
df_GunViolence['Republican Vote'] = df_GunViolence['State'].map(elections_republican)


chart_state = alt.Chart(df_GunViolence).mark_bar().encode(
    alt.X('Ratio State:Q'),
    alt.Y('State:N', sort='-x'),
    alt.Color('Republican Vote:O', 
              scale=alt.Scale(domain=['Democrats won the last 4 elections', 'Democrats won 3 of the last 4 elections', 'Republicans won 2 and Democrats won 2 of the last 4 elections', 'Republicans won 3 of the last 4 elections', 'Republicans won the last 4 elections'], range=['#0000FF', '#ADD8E6', '#800080', '#FF6666', '#FF0000']),
              title='Majoritary Vote in the last 4 elections'
    )
).properties(title = 'Mass shootings per million inhabitants by state')

chart_state

Seems good. However, this plot takes so much space just to display small quantities. For that reason and given that we are only asked in the Q1 for the states with larger number of mass shootings per citizen, we decide to truncate the plot and only show the top k countries.

On the other hand, the legend of the colors takes up so much space and the labels cannot be fully displayed.

In [18]:
sum(df_GunViolence_csv[df_GunViolence_csv['FIPS']==11001]['Shootings'].fillna(0))

0.0

In [9]:
elections_republican = {'Alabama': 4, 'Alaska': 4, 'Arizona': 3, 'Arkansas': 4, 'California': 0, 'Colorado': 0, 'Connecticut': 0, 'Delaware': 0, 'District of Columbia': 0, 'Florida': 3, 'Georgia': 3, 'Hawaii': 0, 'Idaho': 4, 'Illinois': 0, 'Indiana': 4, 'Iowa': 3, 'Kansas': 4, 'Kentucky': 4, 'Louisiana': 4, 'Maine': 0, 'Maryland': 0, 'Massachusetts': 0, 'Michigan': 2, 'Minnesota': 0, 'Mississippi': 4, 'Missouri': 4, 'Montana': 4, 'Nebraska': 4, 'Nevada': 1, 'New Hampshire': 0, 'New Jersey': 0, 'New Mexico': 0, 'New York': 0, 'North Carolina': 4, 'North Dakota': 4, 'Ohio': 3, 'Oklahoma': 4, 'Oregon': 0, 'Pennsylvania': 2, 'Rhode Island': 0, 'South Carolina': 4, 'South Dakota': 4, 'Tennessee': 4, 'Texas': 4, 'Utah': 4, 'Vermont': 0, 'Virginia': 0, 'Washington': 0, 'West Virginia': 4, 'Wisconsin': 2, 'Wyoming': 4}

mapping = {
    4: 'Republicans won the last 4',
    3: 'Republicans won 3',
    2: 'Republicans won 2, Democrats won 2',
    1: 'Democrats won 3',
    0: 'Democrats won the last 4'

}
for state in elections_republican.keys():
    elections_republican[state] = mapping[elections_republican[state]]


df_GunViolence['Republican Vote'] = df_GunViolence['State'].map(elections_republican)


k = 10

top_k_states = df_GunViolence.nlargest(k, 'Ratio State')

chart_state = alt.Chart(top_k_states).mark_bar().encode(
    alt.X('Ratio State:Q'),
    alt.Y('State:N', sort='-x'),
    alt.Color('Republican Vote:O', 
              scale=alt.Scale(domain=['Democrats won the last 4', 'Democrats won 3', 'Republicans won 2, Democrats won 2', 'Republicans won 3', 'Republicans won the last 4'], range=['#0000FF', '#ADD8E6', '#800080', '#FF6666', '#FF0000']),
              title='Majoritary Vote (last 4 elections)'
    )
).properties(title = f'Top {k} States by Mass shootings per inhabitant')

chart_state

<h3 style="color:darkblue"> Q2: How is the number of mass shootings per citizen distributed across the different counties in the US? And across states?

In [10]:
from vega_datasets import data

df_GunViolence = pd.read_csv('datasets/GunViolenceCompleteData.csv')

# Agrupa per comtat i compta el nombre de tirotejos (usant FIPS)
print(df_GunViolence.head())
#df_shootings_grouped = df_GunViolence.groupby(['FIPS']).size().reset_index(name='Total Shootings')

# Carregar la geometria dels comtats dels EUA (mitjançant l'URL de topojson)
map = alt.topo_feature(data.us_10m.url, feature='counties')

# Crear el coroplèstic map amb Altair
map_chart = alt.Chart(map).mark_geoshape().encode(
    color=alt.Color('Total Shootings:Q', scale=alt.Scale(scheme='reds'), title='Total Shootings'),
    tooltip=['properties.name:N', 'Total Shootings:Q']  # Mostra el nom del comtat i el nombre de tirotejos
).transform_lookup(
    lookup='id',  # Codi FIPS del comtat
    from_=alt.LookupData(df_shootings_grouped, 'FIPS', ['Total Shootings'])  # Unir per FIPS
).project(
    type='albersUsa'
).properties(
    width=800, height=500,
    title="Total Shootings per County in the USA"
)

map_chart.show()

   Unnamed: 0  FIPS    State          County  Year  Population  Month_x  \
0           0  1000  Alabama         Alabama  2014   4843737.0        1   
1           1  1001  Alabama  Autauga County  2014     54922.0        1   
2           2  1003  Alabama  Baldwin County  2014    199306.0        1   
3           3  1005  Alabama  Barbour County  2014     26768.0        1   
4           4  1007  Alabama     Bibb County  2014     22541.0        1   

   Month_y  Shootings  School Incidents  Ratio County  
0      NaN        NaN               NaN           NaN  
1      NaN        NaN               NaN           NaN  
2      NaN        NaN               NaN           NaN  
3      NaN        NaN               NaN           NaN  
4      NaN        NaN               NaN           NaN  


NameError: name 'df_shootings_grouped' is not defined

#### Visualization description
We decided to...

<h3 style="color:darkblue"> Q3: Are the mass shootings correlated with gun violence incidents in schools?

In [101]:
df_GunViolence = df_GunViolence_csv

In [95]:
# Agrupar per estat i comptar el nombre total de tirotejos per estat
df_GunViolence = df_GunViolence_csv.fillna(0)


df_GunViolence = df_GunViolence[df_GunViolence['Year']==2023]
df_GunViolence = df_GunViolence.groupby(['State'])[['Shootings', 'School Incidents', 'Population']].agg({
    'Shootings': 'sum',
    'School Incidents': 'sum',
    'Population': 'first'
}).reset_index()

df_GunViolence['Ratio Shootings'] = df_GunViolence['Shootings']*1000000 / df_GunViolence['Population']
df_GunViolence['Ratio School Incidents'] = df_GunViolence['School Incidents'] *1000000/ df_GunViolence['Population']


scatter_plot = alt.Chart(df_GunViolence).mark_circle(size=100).encode(
    y=alt.Y('Ratio Shootings:Q', title='Ratio of Shootings per Population'),
    x=alt.X('Ratio School Incidents:Q', title=''),
    tooltip=['State', 'Shootings', 'School Incidents', 'Population'] 
).properties(
    width=800,
    height=600,
    title="Ratio of Shootings per Population by State"
)

scatter_plot.show()

In [61]:
# Agrupar per estat i comptar el nombre total de tirotejos per estat
df_GunViolence = df_GunViolence_csv.fillna(0)


df_GunViolence = df_GunViolence[df_GunViolence['Year']==2023]
df_GunViolence = df_GunViolence.groupby(['State'])[['Shootings', 'School Incidents', 'Population']].agg({
    'Shootings': 'sum',
    'School Incidents': 'sum',
    'Population': 'first'
}).reset_index()

df_GunViolence['Ratio Shootings'] = df_GunViolence['Shootings']*1000000 / df_GunViolence['Population']
df_GunViolence['Ratio School Incidents'] = df_GunViolence['School Incidents'] *1000000/ df_GunViolence['Population']


scatter_plot = alt.Chart(df_GunViolence).mark_circle(size=100).encode(
    y=alt.Y('Ratio Shootings:Q', title='Ratio of Shootings per Population'),
    x=alt.X('Ratio School Incidents:Q', title='Ratio of School Incidents per Population'),
    tooltip=['State', 'Shootings', 'School Incidents', 'Population'] 
).properties(
    width=800,
    height=600,
    title="Ratio of Shootings per Population by State"
)

df_GunViolence['Label'] = df_GunViolence.apply(
    lambda x: x['State'] if (x['Ratio Shootings'] >= 80 or x['Ratio School Incidents']>=100)  else '', axis=1
)
# Afegir les etiquetes
# Ajustar les etiquetes individualment
# Etiquetes per a "South Carolina" a dalt a l'esquerra
text_labels_sc = alt.Chart(df_GunViolence[df_GunViolence['State'] == 'South Carolina']).mark_text(
    align='center',  # Alineació a l'esquerra
    dx=-20,  # Desplaçament cap a l'esquerra
    dy=-15,  # Desplaçament cap amunt
    size=14  # Mida del text augmentada
).encode(
    y=alt.Y('Ratio Shootings:Q'),
    x=alt.X('Ratio School Incidents:Q'),
    text='State'
)

# Etiquetes per a la resta dels estats a baix a l'esquerra
text_labels_others = alt.Chart(df_GunViolence[(df_GunViolence['Label'] != '') & (df_GunViolence['State'] != 'South Carolina')]).mark_text(
    align='right',  # Alineació a la dreta
    dx=-7,  # Desplaçament cap a l'esquerra
    dy=14,  # Desplaçament cap avall
    size=14  # Mida del text augmentada
).encode(
    y=alt.Y('Ratio Shootings:Q'),
    x=alt.X('Ratio School Incidents:Q'),
    text='State'
)

# Combinar el gràfic de dispersió amb les etiquetes
final_plot = scatter_plot + text_labels_sc + text_labels_others


final_plot.show()



<h4 style="color:goldenrod"> Visualization descprition </h4>
We decided to...

<h3 style="color:darkblue"> Q4: How have mass shootings evolved the last years in the US?

We had to main ideas: showing how have the number of shootings changed during the years (and maybe relationating it with the current governement), and showing the possible underlaying trends inside each year.

Let's start with the first representation:

<h4 style="color:goldenrod"> Shootings per year </h4>

In [None]:
df_GunViolence = df_GunViolence_csv.fillna(0)

df_GunViolence = df_GunViolence.groupby('Year')["Shootings"].sum().reset_index()
df_GunViolence = df_GunViolence[df_GunViolence['Year']<2024]
df_GunViolence['Year'] = pd.to_datetime(df_GunViolence['Year'], format='%Y')
df_GunViolence['Shootings'] = df_GunViolence['Shootings'].apply(lambda x : x/12)

chart_ms = alt.Chart(df_GunViolence).mark_line(
    color='black',
    point=alt.OverlayMarkDef(filled=True, fill="black"),
    strokeWidth=2.5
).encode(
    x=alt.X('Year:T', timeUnit='year', title='Year'),
    y=alt.Y('Shootings:Q', title='Total Shootings').scale(domain=(200,700))
).properties(
    title={
        "text": ['Total Shootings per Year in the USA'],
        "fontSize": 16,
    }
)




chart_ms

Not bad! Let's try to add the current governement (democratic or republican) for each year.

In [10]:
### FIRST ATTEMT ###
gov = pd.DataFrame({
    'Year': [i for i in range(2014, 2025)],
    'y1': [230+500]*11,
    'y2': [220+500]*11,
    'governement': ['Democratic']*3 + ['Republican']*5 + ['Democratic']*3
})

gov['Year'] = pd.to_datetime(gov['Year'], format='%Y')

chart_gov = alt.Chart(gov).mark_area().encode(
    x=alt.X('Year:T', timeUnit = 'year', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('y1:Q', title='Total Shootings', scale=alt.Scale(domain=(200, 720))),
    y2='y2:Q',
    color=alt.Color('governement', legend=alt.Legend(
        orient='none', 
        legendX=185, legendY=-30,  # Ajusta la posició de la llegenda
        direction='horizontal',
        titleAnchor='middle'  # Centra el títol dins de la llegenda
    ), scale=alt.Scale(domain=['Democratic', 'Republican'], range=['#3182bd', '#d7301f']))
)

df_GunViolence['Year'] = pd.to_datetime(df_GunViolence['Year'], format='%Y')

chart_ms = alt.Chart(df_GunViolence).mark_line(
    color='black',
    point=alt.OverlayMarkDef(filled=True, fill="black"),
    strokeWidth=2.5).encode(
    x=alt.X('Year:T', timeUnit='year', title='Year'),
    y=alt.Y('Shootings:Q', title='Total Shootings').scale(domain=(200,750))
).properties(
    title={
        "text":['Total Shootings per Year in the USA'],
        "dy": -10,
        "fontSize": 16,
    },
    width=460,
    height=300
)

legend_data = pd.DataFrame({
    'Year': ['2015-09-18 00:00:00+00:00', '2018-09-18 00:00:00+00:00'],
    'Total Shootings': [700, 660],
    'label': ['Democratic', 'Republican'],
    'governement': ['Democratic', 'Republican']
})



linechart_years = chart_gov + chart_ms
linechart_years

We were not really convinced by the looks and space-efficiency of this approach, so we tried another option:

In [73]:
import altair as alt
import pandas as pd

In [152]:
RED = '#f5b7b1'
BLUE = '#aed6f1'
D = alt.Scale(domain=(3000/12,8500/12))
w = 450

chart_ms = alt.Chart(df_GunViolence).mark_line(
    color='black',
    point=alt.OverlayMarkDef(filled=True, fill="black"),
    strokeWidth=2.5
).encode(
    x=alt.X('Year:T', timeUnit='year', title='Year'),
    y=alt.Y('Shootings:Q', title='Total Shootings', scale=D)
).properties(
    title={
        "text": ['Total Shootings per Year in the USA'],
        "fontSize": 16,
    },
    width=w,
    height=400
)

    #we keep this just for the legend
gov = pd.DataFrame({
    'governement':['Republican', 'Democratic']
})

chart_gov = alt.Chart(gov).mark_area().encode(
    x=alt.X('Year:T', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('y1:Q', title='Total Shootings', scale=D,axis=None),
    y2='y2:Q',
    color=alt.Color('governement', legend=alt.Legend(
        orient='top-left', 
        legendX=130, legendY=-100,  # Ajusta la posició de la llegenda
        strokeColor='gray',
        fillColor='#EEEEEE',
        padding=10,
        cornerRadius=10,
        titleAnchor='middle'  # Centra el títol dins de la llegenda
    ), scale=alt.Scale(domain=['Democratic', 'Republican'], range=[BLUE, RED]))
)


df_dem1 = pd.DataFrame({
    'x1': 2014,
    'x2': 2017,
    'y1': [250]*2, 
    'y2': [700]*2,
    'governement': ['Democratic']*2
})
df_dem1['x1'] = pd.to_datetime(df_dem1['x1'], format ='%Y')
df_dem1['x2'] = pd.to_datetime(df_dem1['x2'], format ='%Y')


df_rep = pd.DataFrame({
    'x1': 2017,
    'x2': 2021,
    'y1': [250]*2, 
    'y2': [700]*2,
    'governement': ['Republican']*2
})
df_rep['x1'] = pd.to_datetime(df_rep['x1'], format ='%Y')
df_rep['x2'] = pd.to_datetime(df_rep['x2'], format ='%Y')

df_dem2 = pd.DataFrame({
    'x1': 2021,
    'x2': 2023,
    'y1': [250]*2, 
    'y2': [700]*2,
    'governement': ['Democratic']*2
})
df_dem2['x1'] = pd.to_datetime(df_dem2['x1'], format ='%Y')
df_dem2['x2'] = pd.to_datetime(df_dem2['x2'], format ='%Y')

# Rectangles for government background
dem1 = alt.Chart(df_dem1).mark_rect(color=BLUE, opacity=1).encode(
    x=alt.X('x1:T', title=None),  # No title for X axis here
    x2='x2:T',
    y=alt.Y('y1:Q', scale=D, axis=None),  # No Y axis for background
    y2=alt.Y2('y2:Q'),
    tooltip=alt.value(None)
)

rep = alt.Chart(df_rep).mark_rect(color=RED, opacity=1).encode(
    x=alt.X('x1:T', title=None),
    x2='x2:T',
    y=alt.Y('y1:Q', scale=D, axis=None),
    y2=alt.Y2('y2:Q'),
    tooltip=alt.value(None)
)

dem2 = alt.Chart(df_dem2).mark_rect(color=BLUE, opacity=1).encode(
    x=alt.X('x1:T', title=None),
    x2='x2:T',
    y=alt.Y('y1:Q', scale=D, axis=None),
    y2=alt.Y2('y2:Q'),
    tooltip=alt.value(None)
)

# Manual gridlines
y_values = pd.DataFrame({'y': [y for y in range(250, 701, 50)]})
lines = alt.Chart(y_values).mark_rule(color='gray', size=0.8, opacity=0.5).encode(
    y=alt.Y('y:Q', scale=D, title=None, axis=None),
    tooltip=alt.value(None)
)




#################################
# adding some extra information #
#################################



line2 = pd.DataFrame({
    'x': [2019,2019],  # Coordenada fixa per la línia vertical
    'y': [294, 390],  # Rang de les coordenades en l'eix Y
})

line2['x'] = pd.to_datetime(line2['x'],  format='%Y')

# Crear el gràfic amb Altair
chart_line2 = alt.Chart(line2).mark_line(strokeWidth=1, color='black').encode(
    x=alt.X('x:T', timeUnit='year'),  # Fixem la coordenada X
    y=alt.Y('y:Q', axis=None,scale=D),
    tooltip=alt.value(None)
)

text2 = pd.DataFrame({
    'x': [2019],
    'y': [280],
    'text': ["GVA reaches 7000+ sources"]
})
text2['x'] = pd.to_datetime(text2['x'], format = "%Y")

chart_text2 = alt.Chart(text2).mark_text().encode(
    text="text",
    x = alt.X('x:T', timeUnit='year'),
    y = alt.Y("y:Q", axis=None,scale=D)
)



# Combine layers with shared y-scale
chart_with_background = alt.layer(
    dem1,       # Background rectangles
    dem2,
    rep,
    lines,     # Gridlines
    chart_ms,   # Main line chart
    chart_gov,
    chart_text2,
    chart_line2
).properties(
    width=w,
    height=400
).resolve_scale(
    y='independent'  # Enforce shared y-scale for all layers
)
chart_with_background

<h4 style="color:goldenrod"> Shooting trends inside a year </h4>

In [157]:
MassShootings = pd.read_csv("datasets/MassShootingsComplete_FIPS.csv")

In [159]:
MassShootings['Year'] = MassShootings['Incident Date'].dt.year
MassShootings['Month'] = MassShootings['Incident Date'].dt.month
monthly_shootings = MassShootings[MassShootings['Year']>2014].groupby(['Year', 'Month']).size().reset_index(name='Total Shootings')
monthly_avg = monthly_shootings.groupby('Month')['Total Shootings'].mean().reset_index()

AttributeError: Can only use .dt accessor with datetimelike values

In [430]:
line_chart_months1 = alt.Chart(monthly_shootings[monthly_shootings['Year']>2013]).mark_line().encode(
    x=alt.X('Month:O', title='Mes', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('Total Shootings:Q', title='Nombre total de tirotejos'),
    color=alt.Color('Year:O', title='Any')
).properties(
    title='Evolució mensual dels tirotejos massius als EUA per any'
)

line_chart_months1

Even though the trend is well understood, there are too many years... we can try to visualize the last 5 years. A lighter color corresponds to an early year, but maybe it's not the best way to show the increasing amount of shootings during years.

In [431]:
line_chart_months2 = alt.Chart(monthly_shootings[monthly_shootings['Year']>2019]).mark_line().encode(
    x=alt.X('Month:O', title='Mes', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('Total Shootings:Q', title='Nombre total de tirotejos'),
    color=alt.Color('Year:N', title='Any')
).properties(
    title='Evolució mensual dels tirotejos massius als EUA per any'
)

line_chart_months2

That's better!

In [40]:
monthly_shootings['year-month'] = (monthly_shootings['Year'])*100 + (monthly_shootings['Month']-1)*10/11

line_chart_mont_year = alt.Chart(monthly_shootings[monthly_shootings['Year']>2013]).mark_line().encode(
    x=alt.X('year-month:O', title='Mes', axis=None),
    y=alt.Y('Total Shootings:Q', title='Nombre total de tirotejos')
).properties(
    title='Evolució mensual dels tirotejos massius als EUA per any',
    width=1200
)

line_chart_mont_year

NameError: name 'monthly_shootings' is not defined

Interesting, but uses way too much space

In [433]:
mesos = ['January', 'February', 'March', 'April', 'May', 'June', 
         'July', 'August', 'September', 'October', 'November', 'December']
monthly_avg['Mes'] = [mesos[m-1] for m in monthly_avg['Month']]


base = alt.Chart(monthly_avg).encode(
    alt.Theta("Month:O").stack(True),
    alt.Radius("Total Shootings").scale(type="sqrt", zero=True, rangeMin=20),
    alt.Color('Total Shootings:Q', scale=alt.Scale(scheme='reds'))  
)


c1 = base.mark_arc(innerRadius=20, stroke="#fff")


c2 = base.mark_text(
    radiusOffset=18
).encode(
    text="Mes:N",
    color=alt.condition(
        "datum.Total_Shootings > " + str(5000),
        alt.value("white"),
        alt.value("black"),
    )
)

# Combine charts
polar_area_chart = (c1 + c2).properties(
    width=400,
    height=400,
    title="Mitjana mensual de tirotejos massius als EUA"
).configure_legend(
    disable=True
)

polar_area_chart

This chart is very nice visualy, but uses too much space for proving only the distribution amongh months. 
It would be good if we could show this dirtribution together with the first chart (shootings amongst years). If we only want to show the average distribution of shootings, we could use a 1x12 heatmap. It uses much less space and shows the trend perfectly.

In [154]:
import pandas as pd
import altair as alt


monthly_avg = monthly_shootings.groupby('Month')['Total Shootings'].mean().reset_index()


months = ['January', 'February', 'March', 'April', 'May', 'June', 
          'July', 'August', 'September', 'October', 'November', 'December']
monthly_avg['Month Name'] = [months[m-1] for m in monthly_avg['Month']]


heatmap = alt.Chart(monthly_avg).mark_rect().encode(
    y=alt.Y('Month Name:N', sort=months, title=None),  # Canviar a l'eix Y
    color=alt.Color('Total Shootings:Q',
                    scale=alt.Scale(scheme='lightorange'),legend=None),
    tooltip=[
        alt.Tooltip('Month Name:N', title='Month'),
        alt.Tooltip('Total Shootings:Q', title='Average', format='.1f')
    ]
).properties(
    width=40,
    height=400,
    title={
        "text":['Average month','ditribution'],
        "dy": 0,
        "fontSize": 16
        }
)

text = alt.Chart(monthly_avg).mark_text(
    baseline='middle',
    color='white'
).encode(
    y=alt.Y('Month Name:N', sort=months),
    text=alt.Text('Total Shootings:Q', format='.1f')
)

heatmap_month_distribution = (heatmap + text)

heatmap_month_distribution


NameError: name 'monthly_shootings' is not defined

Much better. Let's try to visualize this chart together with the first one:

#### Final visualization 4

In [153]:
# Crear el gràfic concatenat sense configuracions individuals
combined_chart = alt.hconcat(
    chart_with_background,
    heatmap_month_distribution
).resolve_scale(
    color='independent'
).configure_axis(
    labelAngle=0,
    labelFontSize=12
).configure_title(
    fontSize=14,
    anchor='middle'
)

combined_chart


NameError: name 'heatmap_month_distribution' is not defined


<h4 style="color:goldenrod"> Visualization descprition </h4>

To analyze how shootings in the USA have evolved in recent years, we initially identified two key aspects:

- Showing the yearly trend of shootings.
- Showing the monthly distribution of shootings.
    
For the yearly trend, we decided on a line chart, as it effectively represents temporal data and makes trends clear. Initially, we tried plotting a line for each state, but the result was cluttered with 50+ lines. Instead, we aggregated all states to show the total shootings nationwide. Additionally, we incorporated the governing political party by coloring the chart's background, which was visually clear and used space efficiently.

For the monthly trend, the solution wasn’t as straightforward. We first attempted a line chart with total shootings per month and lines for each year, but this created too much clutter. Limiting it to the last five years improved readability, revealing a clear seasonal pattern with higher shootings in summer.

We also tried a polar chart, which was visually appealing but used too much space for the information conveyed. Finally, we opted for a 1D heatmap, encoding the average shootings per month with color. This approach effectively displayed the monthly distribution while occupying minimal space.

This combination of visualizations allowed us to communicate both annual trends and monthly patterns effectively.

In [None]:
import os 
# os.system('streamlit run ./static_vi.py')