# Visualization Project

---
**Authors**:

- Juan Pablo Zaldivar
- Enric Millán
---

## Introduction

In this second project, the focus is on analyzing collision data in New York City during the summer of 2018. The primary objective is to develop a comprehensive interactive visualization that can address several key questions regarding the nature and patterns of these collisions. With the use of datasets related to collisions, weather conditions, and the New York City map, we aim to explore various facets, including ...

This file contains all the steps required to ensure reproducibility of steps leading from raw data to a clean dataset. The project is divided in three parts: the first part corresponds to the preprocessing of the data, the second part corresponds to the visualization desing process and the third part corresponds to the implementation of the visualization in the streamlit app to answer the questions.

The datasets are as follows:

- Collision Dataset: Extracting and filtering collision data specifically from June to September of 2018. This involves selecting relevant columns, handling missing or inconsistent data, and ensuring data quality.

- Weather Dataset: Locating and incorporating weather data corresponding to the time frames and areas of interest.

- New York City Map: Acquiring a suitable map of New York City to overlay geographical information related to collision locations.

### Dataset obtention and description

The Collisions dataset (`collisions_2018-2020.csv`) was extracted from the former project. The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The dataset has to be preprocessed again in order to meet the new specifications.

The weather dataset (`weather2018.csv`) was already given by the supervisors of the project. It contains the weather conditions of the city of New York during the summer of 2018.

The map dataset was obtained from this [cartography web page](https://cartographyvectors.com/map/508-new-york-city-boroughs-ny).

The datasets are located in the folder `Data/`. Following are the loading of each dataset and the import of the required libraries.

### Libraries

For the correct functionality of the executions, the following folders and all their files are needed:

- `Data/`: Folder containing the datasets.
- `Modules/`: Folder containing the modules used in the project.

In [1]:
# pip install altair==5.1.2 pandas==1.5.3 numpy==1.23.5 altair==5.1.2 h3pandas==0.2.5 geopandas==0.13.2 vegafusion[embed]>=1.4.0

In [2]:
import os
import pandas as pd
from Modules import preprocessing as pre

In [3]:
dir = './Data'

## Dataset Preprocessing

The preprocessing of the files involved a collaborative effort using both OpenRefine and selected Python libraries. This strategic approach was adopted to take advantage of the unique strengths and capabilities offered by each tool. OpenRefine facilitated initial data cleaning and transformation tasks, with its user-friendly interface for effective manipulation of datasets. Simultaneously, Python libraries were utilized to perform more complex data operations and manipulations, with special emphasis on the extensive functionalities and flexibility they provide. This combination allowed for a comprehensive preprocessing workflow that maximized efficiency and accuracy in preparing the data for subsequent analyses and visualization tasks.

### Collision Dataset Preprocessing

The initial step involved loading the original dataset into a `Pandas` dataframe, primarily to apply a date range filter efficiently. The rationale behind this approach was to optimize the data filtering process, considering the considerable size of the original dataset. The volume of data posed challenges within OpenRefine, leading to slow and inefficient computational processes. By filtering the dataset using `Pandas`, it allowed for a more streamlined and quicker selection of the desired date range. Following this initial filtering phase, the refined dataset was exported as a `.csv` file (`collisions-2018.csv`) and subsequently imported into OpenRefine for further data processing and cleaning procedures. This sequential approach ensured a balance between computational efficiency and data handling capabilities across both Pandas and OpenRefine, resulting in a more effective preprocessing workflow.

In [4]:
collisions = pd.read_csv(dir + '/collisions_2018-2020.csv')
collisions.head(1)

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,2020-09-06,18:05,,,40.771038,-73.83413,"(40.771038, -73.83413)",WHITESTONE EXPRESSWAY,,,...,Following Too Closely,,,,4345591,Station Wagon/Sport Utility Vehicle,Motorcycle,,,


In [5]:
collisions['YEAR'] = pd.DatetimeIndex(collisions['CRASH DATE']).year
collisions.shape

(115740, 30)

In [6]:
collisions = collisions[collisions['YEAR'] == 2018]
collisions.shape

(79383, 30)

#### Data Type Conversion

In OpenRefine, the data conversion process involved several attribute adjustments. The CRASH DATE attribute underwent a conversion to a date type for enhanced consistency and data clarity. Meanwhile, both COLLISION ID and CRASH TIME were temporarily set as string types.

Attributes pertaining to the geographical location of the collisions were modified to strings, accompanied by specific notations. As part of this process, all values were standardized to uppercase, and any extra spaces were removed where applicable. This standardization was implemented to facilitate the effectiveness of the clustering method utilized for collectively inspecting and modifying cells. The objective was to streamline the identification and correction of any inconsistencies or inaccuracies within the data, ensuring a more uniform and reliable dataset for subsequent analyses.

The attributes pertaining to the number of persons involved in the collision underwent a data type conversion to integers within the dataset. This decision was driven by the discrete nature of these values and the expectation that these numerical counts wouldn't contain negative values.

Conversely, the attributes related to vehicles and factors involved in the collisions were retained as strings temporarily. This choice was made to maintain flexibility in handling these attributes during subsequent data processing and analysis phases, ensuring that any necessary modifications or categorizations could be applied effectively as the analysis progressed.

#### Data Selection and Transformation

From previous knowledge of the dataset, a subset of attributes was selected for further analysis. This selection was based on the relevance of the attributes to the research questions and the availability of data.

In [7]:
interest_cols = ['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME', 'OFF STREET NAME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1', 'COLLISION_ID', 'VEHICLE TYPE CODE 1']

collisions = collisions[interest_cols]

**Geographic attributes**

As we saw in the previous project, **ON STREET NAME** and **OFF STREET NAME** seem to be the same attribute, but with different names. The web site of the dataset cointains the following descriptions:

- **ON STREET NAME**: *Street on which the collision occurred*.
- **OFF STREET NAME**: *Street address if known*.

This gives the idea that both attributes probably contain the same information. Furthermore, there are no rows with both attributes filled, which makes the idea of merging both attributes plausible and would consolidate information without redundancy.

In [8]:
collisions[(collisions['ON STREET NAME'].notnull()) & (collisions['OFF STREET NAME'].notnull())].shape

(0, 19)

In [9]:
collisions[(collisions['ON STREET NAME'].notnull()) | (collisions['OFF STREET NAME'].notnull())].shape

(79157, 19)

The resulting attribute after merging both columns is called **STREET NAME** and contains the street name/address where the collision occurred, with no missing values. Some rows will have a more detailed description of the street, while others will only have the name of the street. The fusion of the two columns will be done in OpenRefine.

The clustering process utilized the key collision method in conjunction with the fingerprint keying function, applied separately to each individual geographical attribute. After a couple of iterations, no substantial alterations in attribute values were identified. However, during this process, misspellings were detected and rectified to ensure data accuracy.

The misspellings were corrected and the clusterization was done again. The results were the same, which means that the values were already consistent. To verify the result, a Neares Neighbours analysis was done as well but without finding any significant variation.

**Vehicle attributes**

We have seen already that there are many classes of the VEHICLE CODE TYPE 1 values. To simplify this complexity, the project statement force to reduce the diversity of classes by employing a clustering technique. This clustering methodology allowed us to condense the multitude of classes within VEHICLE CODE TYPE 1 into a more manageable set of clusters, facilitating a more comprehensible and concise representation for subsequent visualization and analysis purposes.

In [10]:
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['Taxi', 'TAXI'], 'TAXI')
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['Fire', 'FD tr', 'firet', 'fire', 'FIRE', 'fd tr', 'FD TR', 'FIRE', 'FIRET'], 'FIRE')
collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['AMBUL', 'Ambulance', 'ambul', 'AMB', 'Ambul', 'AMBULANCE', 'AMBU'], 'AMBULANCE')

The resulting classes are the following:

- **TAXI**: 
- **FIRE**:
- **AMBULANCE**:

A similar strategy was done with the CONTRIBUTING FACTOR VEHICLE 1 attribute. However, the aggregation was not so exhaustive since this attribute wasn't needed a priori for the main questions that the visualizations should answer. For this attribute basic merge transformations were applied in OpenRefine until no "strange" or "uninformative" nor repeated classes remained.

**Number of persons attributes**

For the visualization purposes, the differentantion of **PERSONS**, **PEDESTRIANS**, **CYCLISTS** and **MOTORISTS** (**INJURED/KILLED**) is irrelevant. A more useful attribute would be the total number of persons involved in the collision. This can be obtained by summing the four attributes under the assumption that the **PERSONS** attribute is not the sum of the other three attributes.

This condition was needed to be checked because the documentation of the dataset was not precise enough to determinate if **NUMBER OF PERSON INJURED/KILLED** was an aggregate from the other three columns or not.

*Note: The metadata information available in the web of the dataset was: "Number of persons injured/killed" regarding the **NUMBER OF PERSONS INJURED/KILLED**.*

In [11]:
collisions['NUMBER OF PERSONS INJURED'].equals(collisions['NUMBER OF PEDESTRIANS INJURED'] + collisions['NUMBER OF CYCLIST INJURED'] + collisions['NUMBER OF MOTORIST INJURED'])

False

In [12]:
collisions['NUMBER OF PERSONS INJURED'].equals(collisions['NUMBER OF PEDESTRIANS INJURED'])

False

In [13]:
collisions['NUMBER OF PERSONS KILLED'].equals(collisions['NUMBER OF PEDESTRIANS KILLED'] + collisions['NUMBER OF CYCLIST KILLED'] + collisions['NUMBER OF MOTORIST KILLED'])

False

In [14]:
collisions['NUMBER OF PERSONS KILLED'].equals(collisions['NUMBER OF PEDESTRIANS KILLED'])

False

As seen by the logical comprobations, the **NUMBER OF PERSONS INJURED/KILLED** is not the sum of the other three attributes. Furthermore, the terms persons and pedestrians are not equal, as one could have thought that the term persons was used to refer to pedestrians.

Based on this, the discrete attributes refering to the injured people were summed to obtain **NUMBER OF INJURED** and the discrete attributes refering to the killed people were summed to obtain **NUMBER OF KILLED**. The **NUMBER OF INJURED/KILLED** attributes were removed.

In [15]:
collisions['TOTAL INJURED'] = collisions.filter(regex='INJURED').sum(axis=1)
collisions['TOTAL KILLED'] = collisions.filter(regex='KILLED').sum(axis=1)

collisions = collisions.drop(collisions.iloc[:, 8:16], axis=1)

---

In [16]:
collisions.to_csv(dir + '/collisions-2018.csv', index=False)

In [17]:
collisions = pd.read_csv(dir + '/collisions-2018_prepro_v1.csv')

In [18]:
collisions = collisions[collisions['VEHICLE TYPE CODE 1'].isin(['FIRE', 'TAXI', 'AMBULANCE'])]

collisions['VEHICLE TYPE CODE 1'].unique()

array(['TAXI', 'AMBULANCE', 'FIRE'], dtype=object)

In [19]:
collisions.shape

(4092, 12)

At this point, the dataset contains the attributes needed (except for weather attributes) for the analysis and some extra attributes that were considered interesting for some possible extra analysis or insights that we could be added to the visualization.

---

#### Missing Values

It has already been mentioned the existence of some missing values. In the previous section, the verification of missing values was done with the .isnull() method of Pandas. However, this method does not take into account the `NaN` values. In order to check the existence of NaN values, the .isna() method was used.

In [20]:
comp = (collisions.isnull().sum() == collisions.isna().sum())
comp[comp == False]

Series([], dtype: bool)

As seen previously, all the missing values of the dataset are detected both with .isnull() and .isna(). After this check, notice that the only attributes with missign values are corresponding to geographical properties of the collisions. We can apply a similar strategy as the one used in the previous project to fill the missing values.

In [21]:
collisions.isnull().sum()

COLLISION_ID                        0
CRASH DATE                          0
CRASH TIME                          0
BOROUGH                          1385
ZIP CODE                         1385
LATITUDE                          292
LONGITUDE                         292
STREET NAME                         0
CONTRIBUTING FACTOR VEHICLE 1       0
VEHICLE TYPE CODE 1                 0
TOTAL INJURED                       0
TOTAL KILLED                        0
dtype: int64

In [22]:
collisions['STREET NAME'] = collisions['STREET NAME'].apply(pre.capitalize_street)

Initially, our focus lies in addressing the missing attributes within the coordinates. Our approach involves leveraging the information present in the **STREET NAME** column, which is devoid of any missing values. The outlined process is as follows:

Utilization of the `Nominatim` geocoding service enables us to retrieve coordinates corresponding to the street names. Whenever available, we employ the BOROUGH or ZIP CODES to refine our search parameters. Upon a successful search, we populate the missing values with the obtained coordinates. However, in cases where the search yields no results, we maintain the missing values in their current state. This is done by the `pre.fill_missing_coordinates()` function.

In [23]:
if not os.path.exists(dir + '/collisions-2018_prepro_v2.csv'):
    collisions = collisions.apply(pre.fill_missing_coordinates, axis=1)
    collisions.to_csv(dir + '/collisions-2018_prepro_v2.csv', index=False)
else:
    collisions = pd.read_csv(dir + '/collisions-2018_prepro_v2.csv')

In [24]:
collisions.isnull().sum()

COLLISION_ID                        0
CRASH DATE                          0
CRASH TIME                          0
BOROUGH                          1385
ZIP CODE                         1385
LATITUDE                           34
LONGITUDE                          34
STREET NAME                         0
CONTRIBUTING FACTOR VEHICLE 1       0
VEHICLE TYPE CODE 1                 0
TOTAL INJURED                       0
TOTAL KILLED                        0
dtype: int64

In [25]:
34*100/collisions.shape[0]

0.83088954056696

The proportion of missing values reduces notably for the coordinate values, accounting for only $0.83%$ of the total rows. Addressing the missing entries in the **BOROUGH** and **ZIP CODE** columns involves a strategy that involves verifying whether a given point, defined by its coordinates (**LONGITUDE**, **LATITUDE**), resides within the boundary polygon of distinct **BOROUGH** or **ZIP CODE** values. When the original attribute value is null, this methodology endeavors to allocate the appropriate borough based on the geographic coordinates provided.

For that, the extraction of polygons for both the **BOROUGH** and **ZIP CODE** attributes is imperative. The polygons for the **BOROUGH** attribute were acquired from the existing files accessible within the `./Data` folder.

In [26]:
borough_poly = pre.get_borough_polygons()

In [27]:
zip_poly = pre.get_zip_polygons()

In [28]:
pre.fill_missing_borough_zip(collisions, borough_poly, zip_poly)

  df.loc[idx, 'ZIP CODE'] = z
  return lib.within(a, b, **kwargs)


In [29]:
collisions['BOROUGH'].value_counts()

BOROUGH
MANHATTAN        2365
BROOKLYN          617
QUEENS            568
BRONX             470
STATEN ISLAND      12
Name: count, dtype: int64

From this breakdown, it's evident that Manhattan has the highest count of imputed values, followed by Brooklyn, Queens, Bronx, and finally Staten Island with the lowest count. This distribution suggests that a significant portion of missing values were inferred and filled with these borough names. The disparities in counts might be attributed to various factors such as the availability of geographic data or the frequency of missing values per borough.

In [30]:
collisions.isnull().sum()

COLLISION_ID                      0
CRASH DATE                        0
CRASH TIME                        0
BOROUGH                          60
ZIP CODE                         77
LATITUDE                         34
LONGITUDE                        34
STREET NAME                       0
CONTRIBUTING FACTOR VEHICLE 1     0
VEHICLE TYPE CODE 1               0
TOTAL INJURED                     0
TOTAL KILLED                      0
dtype: int64

A noticeable reduction in the count of missing values is apparent in the **BOROUGH** and **ZIP CODE** attributes compared to the missing values found in the coordinates. The relatively diminished count of missing values in these attributes allows for their removal from the dataset without impacting the analytical outcomes. This removal is feasible because these missing values represent a small proportion and their exclusion does not significantly affect the analysis.

In [31]:
collisions.isnull().any(axis=1).sum()

89

In [32]:
collisions = collisions.dropna()
collisions['BOROUGH'] = collisions['BOROUGH'].apply(pre.capitalize_boroughs)

Finally, we impute some extra variables that will be useful to have different aggregation levels for time in the visualization: **HOUR**, **MONTH** (with number), **WEEKDAY** and **MONTH** (with name).

In [33]:

collisions['HOUR'] = pd.to_datetime(collisions['CRASH TIME']).dt.hour
collisions['MONTH'] = pd.to_datetime(collisions['CRASH DATE']).dt.month

collisions = collisions[collisions['LONGITUDE'] != 0]

collisions['WEEKDAY'] = pd.to_datetime(collisions['CRASH DATE']).dt.day_name()
collisions['MONTH'] = pd.to_datetime(collisions['CRASH DATE']).dt.month_name()

collisions['VEHICLE TYPE CODE 1'] = collisions['VEHICLE TYPE CODE 1'].replace(['FIRE', 'TAXI', 'AMBULANCE'], ['Fire', 'Taxi', 'Ambulance'])

  collisions['HOUR'] = pd.to_datetime(collisions['CRASH TIME']).dt.hour


In [34]:
collisions.to_csv(f'{dir}/collisions_clean.csv', index=False)

---

### Weather Dataset Preprocessing

In [35]:
weather = pd.read_csv(f'{dir}/weather2018.csv')
pd.set_option('display.max_columns', None)
weather.head()

Unnamed: 0,name,datetime,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,precipprob,precipcover,preciptype,snow,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,sunrise,sunset,moonphase,conditions,description,icon,stations
0,new york,2018-06-01,26.7,17.9,21.6,26.7,17.9,21.6,19.2,86.8,0.282,100,16.67,rain,0,0,83.5,12.6,36.9,1009.6,65.9,11.3,194.0,16.5,9,,2018-06-01T05:27:08,2018-06-01T20:21:01,0.6,"Rain, Partially cloudy",Partly cloudy throughout the day with rain in ...,rain,"72505394728,72055399999,KLGA,KJRB,KNYC,F1417,7..."
1,new york,2018-06-02,30.6,20.7,25.1,32.9,20.7,25.9,19.8,74.0,0.346,100,8.33,rain,0,0,55.0,22.3,31.0,1007.6,35.4,15.8,240.4,20.8,9,,2018-06-02T05:26:43,2018-06-02T20:21:44,0.63,"Rain, Partially cloudy",Partly cloudy throughout the day with rain.,rain,"72505394728,72055399999,KLGA,KJRB,KNYC,F1417,7..."
2,new york,2018-06-03,21.7,12.8,17.0,21.7,12.8,17.0,12.5,75.0,2.929,100,12.5,rain,0,0,35.3,24.1,96.3,1015.8,92.7,15.6,102.2,8.9,3,,2018-06-03T05:26:20,2018-06-03T20:22:26,0.67,"Rain, Overcast",Cloudy skies throughout the day with rain.,rain,"72505394728,72055399999,KLGA,KJRB,KNYC,F1417,7..."
3,new york,2018-06-04,23.7,11.7,16.8,23.7,11.7,16.8,12.4,76.6,223.796,100,41.67,rain,0,0,29.5,16.7,48.8,1009.2,71.6,15.4,226.0,19.4,8,,2018-06-04T05:25:59,2018-06-04T20:23:06,0.7,"Rain, Partially cloudy",Partly cloudy throughout the day with rain cle...,rain,"72505394728,72055399999,KLGA,KJRB,KNYC,F1417,7..."
4,new york,2018-06-05,24.2,16.1,19.8,24.2,16.1,19.8,11.7,60.7,0.0,0,0.0,,0,0,37.1,25.9,283.1,1005.6,35.7,16.0,199.7,17.3,9,,2018-06-05T05:25:40,2018-06-05T20:23:45,0.73,Partially cloudy,Partly cloudy throughout the day.,partly-cloudy-day,"72505394728,72055399999,KLGA,KJRB,KNYC,F1417"


In [36]:
weather.shape

(122, 33)

In [37]:
weather.describe()

Unnamed: 0,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,precipprob,precipcover,snow,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,moonphase
count,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,93.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,122.0,0.0,122.0
mean,28.092623,20.943443,24.152459,29.482787,20.982787,24.762295,18.202459,71.332787,6.37273,61.47541,12.807541,0.0,0.0,42.722581,19.313115,170.00082,1017.756557,45.837705,15.398361,197.408197,17.027049,7.04918,,0.483443
std,3.994318,3.134558,3.273566,5.509071,3.221562,4.025217,3.797247,11.060401,29.415435,48.866019,17.46702,0.0,0.0,14.73081,6.794662,98.724638,5.09336,33.667473,1.071338,84.628818,7.304544,2.535215,,0.286131
min,18.7,11.7,16.8,18.7,11.7,16.8,10.3,46.0,0.0,0.0,0.0,0.0,0.0,25.9,8.8,15.2,1005.6,0.2,11.0,19.5,1.6,1.0,,0.0
25%,25.075,18.9,22.075,25.075,18.9,22.075,15.1,63.325,0.0,0.0,0.0,0.0,0.0,33.5,13.8,65.425,1014.125,15.0,15.325,138.425,11.875,6.0,,0.25
50%,28.7,21.1,24.3,28.95,21.1,24.4,18.85,72.8,0.2415,100.0,8.33,0.0,0.0,38.9,18.4,180.15,1018.1,41.9,15.9,209.45,18.05,8.0,,0.5
75%,31.175,23.3,26.45,33.8,23.3,27.375,21.575,79.15,2.255,100.0,16.67,0.0,0.0,44.6,24.1,253.275,1021.475,75.075,16.0,262.775,22.775,9.0,,0.72
max,35.6,26.7,30.6,41.4,29.5,34.6,23.6,93.4,231.468,100.0,100.0,0.0,0.0,86.8,40.7,350.1,1030.2,100.0,16.0,331.3,28.6,10.0,,0.97


In [38]:
weather.isnull().sum()

name                  0
datetime              0
tempmax               0
tempmin               0
temp                  0
feelslikemax          0
feelslikemin          0
feelslike             0
dew                   0
humidity              0
precip                0
precipprob            0
precipcover           0
preciptype           47
snow                  0
snowdepth             0
windgust             29
windspeed             0
winddir               0
sealevelpressure      0
cloudcover            0
visibility            0
solarradiation        0
solarenergy           0
uvindex               0
severerisk          122
sunrise               0
sunset                0
moonphase             0
conditions            0
description           0
icon                  0
stations              0
dtype: int64

In [39]:
weather['icon'].unique()

array(['rain', 'partly-cloudy-day', 'clear-day', 'cloudy'], dtype=object)

For desing purposes, we decided to change the **icon** variable values to the following ones:

In [40]:
weather['icon'] = weather['icon'].replace('rain', 'Rainy')
weather['icon'] = weather['icon'].replace('clear-day', 'Clear')
weather['icon'] = weather['icon'].replace('partly-cloudy-day', 'Partly cloudy')
weather['icon'] = weather['icon'].replace('cloudy', 'Cloudy')

And now, we select the column that we need to answer the proposed questions, and some extra variables that may be useful for an extra visualization:

In [41]:
weather = weather[['datetime', 'temp', 'icon', 'tempmin', 'tempmax']]

weather.columns = ['datetime', 'temp', 'ICON', 'tempmin', 'tempmax']

weather.to_csv(f'{dir}/weather_clean.csv', index=False)

## Visualization Design