Steps in collect and prepare data for visualizations:
1. Identify the data sources: databases, spreadsheets, APIs, and other data sources
2. Collect the data: manually or web scraping, API requests, or data extraction tools
3. Clean the data: filling in missing data, removing duplicates, and fixing inconsistencies
4. Organize and format data: structuring the data into tables, formatting the data in a standardized way, and creating data dictionaries

In [19]:
import pandas as pd

#CSV LINK: https://www.kaggle.com/datasets/zaraavagyan/weathercsv?resource=download

# Collect the data
df = pd.read_csv("weather.csv")

In [20]:
#Exploring

print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())

   MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0      8.0     24.3       0.0          3.4       6.3          NW   
1     14.0     26.9       3.6          4.4       9.7         ENE   
2     13.7     23.4       3.6          5.8       3.3          NW   
3     13.3     15.5      39.8          7.2       9.1          NW   
4      7.6     16.1       2.8          5.6      10.6         SSE   

   WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  ...  Humidity3pm  \
0           30.0         SW         NW           6.0  ...           29   
1           39.0          E          W           4.0  ...           36   
2           85.0          N        NNE           6.0  ...           69   
3           54.0        WNW          W          30.0  ...           56   
4           50.0        SSE        ESE          20.0  ...           49   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1019.7       1015.0         7         7     14.4     23.6 

In [21]:
# Clean the data

df = df.dropna() # remove missing values
df = df.drop_duplicates() # remove duplicates
print(df)

     MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0        8.0     24.3       0.0          3.4       6.3          NW   
1       14.0     26.9       3.6          4.4       9.7         ENE   
2       13.7     23.4       3.6          5.8       3.3          NW   
3       13.3     15.5      39.8          7.2       9.1          NW   
4        7.6     16.1       2.8          5.6      10.6         SSE   
..       ...      ...       ...          ...       ...         ...   
361      9.0     30.7       0.0          7.6      12.1         NNW   
362      7.1     28.4       0.0         11.6      12.7           N   
363     12.5     19.9       0.0          8.4       5.3         ESE   
364     12.5     26.9       0.0          5.0       7.1          NW   
365     12.3     30.2       0.0          6.0      12.6          NW   

     WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  ...  Humidity3pm  \
0             30.0         SW         NW           6.0  ...           29   
1      

In [22]:
# Organize the data

df_sorted = df.sort_values('MinTemp', ascending=False) # sort the data by total_bill column
print(df_sorted)

     MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
72      20.9     35.7       0.0         13.8       6.9          SW   
51      19.9     22.0      11.0          4.4       5.9         NNW   
95      18.2     22.6       1.8          8.0       0.0         ENE   
90      18.0     34.9       0.0          9.2       9.9          NW   
76      17.9     33.2       0.0         10.4       8.4           N   
..       ...      ...       ...          ...       ...         ...   
283     -3.5      7.6       0.4          2.4       4.7          NW   
265     -3.5     11.2       0.0          1.6       7.7         ESE   
297     -3.7     14.4       0.0          2.6      10.4         NNW   
313     -3.7     14.7       0.0          3.4      10.9         SSE   
292     -5.3     13.1       0.0          2.2       7.9          NW   

     WindGustSpeed WindDir9am WindDir3pm  WindSpeed9am  ...  Humidity3pm  \
72            50.0          E        WNW           4.0  ...           28   
51     