<h1 style="text-align:center;">Mass Shootings in the USA</h1>

<p style="text-align:center;">Laia Mogas i Pau Mateo</p>
<p style="text-align:center;">Information Visualization - GCED</p>

## Obtaining data
The main data sources we used in this project are:
- https://www2.census.gov/
- https://www.gunviolencearchive.org/reports

From theese sources we obtain the following datasets:
- Mass shootings (january 2014 to september 2024)
- County population


## Preprocessing and data cleaning

To clean and preprocess our data we used Open Refine and pytohn.
#### Concatenating mass shootings csv
The raw data from the Gun Violence Archaive is splitted into various datasets. To add everything into a same csv file, we simply used Open Refine and opened all the csv when starting a new project. This way, the multiple csv are automatiqually concatenated, as they all have the same columns. Before doing that, we had to eliminate some of the duplicate rows in the data, as the file `MassShootings_2021.csv` overlaps with the file `GunVioleceAllYears.csv`. Again, we used Open Refine to do so, by making a time facet of the file `MassShootings_2021.csv` and selectig the rows that were already in the `GunVioleceAllYears.csv` file, and then eliminating them.

#### Tranformations
We applied some basic transformations to the dataset, mainly just changing the types of some columns, such as *Incident Date* to date, and numerical columns into intiger. 

#### Obtaining counties with OSM
To answer accuratelly the questions of the projecy, we needed the exact county where each incident occured. To obtain those, we first used Open Refine to obtain some json information from Open Street Maps, with the methos "Add column by fetching URLs". The command we used is:

In [None]:
'https://nominatim.openstreetmap.org/search?format=json&email=pau.mateo.bernado@estudiantat.upc.edu&app=google-refine&q=' + 
escape(cells["City Or County"].value + ", " + cells["State"].value, 'url') + ‘&limit=1&addressdetails =1’

With this, we obtained some json information for each row, for example:

In [None]:
[{"place_id":323075484,
"licence":"Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type":"relation",
"osm_id":1180533,
"lat":"38.6280278",
"lon":"-90.1910154",
"class":"boundary",
"type":"administrative",
"place_rank":12,
"importance":0.7078550940195126,
"addresstype":"independent_city",
"name":"Saint Louis",
"display_name":"Saint Louis, Missouri, United States",
"boundingbox":["38.5323215","38.7743018","-90.3206525","-90.1641941"]}]

From this information we wantet to obtain the county for each incident. However, we weren't capable of doing so with Open Refine, as the URL used didn't return the names of the counties in the same format as the dataest form census.gov, so we had to process again this informatino with pytohn.
We defined some functions that try to find a county name for each incident that exists in the population dataset obtained from census.gov. The functions iterate through the names in the key `"display_name"` from the json column (as there are some cases where the Township or some other information is also in the json code, so the county is not always in the same position) and applies some transformations to them and checks if the resulting name it's a county.

In [None]:
import pandas as pd
from collections import defaultdict

pop = pd.read_csv("datasets/CountyPopulationAllYears")

names_counties = defaultdict(set)  # names of the counties for each state

for state in pop['STNAME'].values:
    names_counties[state] = set(pop[pop['STNAME']==state]['CTYNAME'].values)


def transforms(name: str) -> str:
    if 'Clarke' in name: return 'Clarke County'
    if 'Washington' in name: return 'District of Columbia'
    name = name.replace('Saint', 'St.').replace('Nashville-', '').replace('Vista', 'Buena Vista').replace('Compton', 'Los Angeles County')
    name = name.replace('Monterey Park', 'Monterey County')
    return name.strip()


def try_find(name, state) -> str | None:
    '''tries to find the countie corresponding to `name` by 
    appliying transformations and trying diferent formats'''
    if name in names_counties[state]: return name

    if transforms(name) in names_counties[state]: return transforms(name)

    #try county
    n = transforms(name)
    if 'County' not in n: n += ' County'
    if n in names_counties[state]: return n

    #try parish
    n = transforms(name)
    if 'Parish' not in n: n += ' Parish'
    if n in names_counties[state]: return n

    #try city
    n = transforms(name)
    if 'city' not in n: n += ' city'
    if n in names_counties[state]: return n

    #try Municipality
    n = transforms(name)
    if 'city' not in n: n += ' Municipality'
    if n in names_counties[state]: return n
    
    else: return None


def extract_county(row) -> str | None:
    '''tries to extract the corresponding countie for the given row'''
    json_data = row['json']
    state = row['State'].strip()

    places = json_data.loads(json_data)
    if places:
        names = places[0].get("display_name", "")
        # Cerca "County", "Parish", o altres entitats equivalents en display_name
        
        delims = [",", ")", "("]

        for delim in delims:
            names = "_".join(names.split(delim))
        names = names.split("_")
        names = [n.strip() for n in names if n != '']

        for n in names:
            cname = try_find(n, state)
            if cname: return cname
            else: continue
    
        return None
    

#usage
MassShootingsComplete

### Obtaining FIPS codes

In [None]:
def get_fips(row) -> str | None:
    """
    Returns the FIPS code for the given row by searching
    it in the population csv
    """
    # Comprovar si 'county' no és None abans d'aplicar strip()
    county_value = row['county']
    if county_value is not None:
        county_value = county_value.strip()  # eliminar espais blancs si no és None
    else:
        county_value = ""  # Si és None, assignar una cadena buida per evitar errors
    
    # Filtrar df2 per obtenir la fila amb la mateixa combinació de 'State' i 'county'
    fips_value = pop[(pop['STNAME'] == row['State']) & (pop['CTYNAME'] == county_value)]['FIPS']
    
    # Retornar el primer valor (ja que hauria de ser únic)
    return f'{fips_value.iloc[0]:02d}' if not fips_value.empty else None

## Questions

### **Q1**: What are the states with large number of mass shootings per citizen?

### **Q2:** How is the number of mass shootings per citizen distributed across the different counties in the US? And across states?

### **Q3:** Are the mass shootings correlated with gun violence incidents in schools?

### **Q4:** How have mass shootings evolved the last years in the US?