# Exploring Sao Paulo temperature records

Let's have a look in Sao Paulo temperature records. This city expanded a lot, with a highly increase of cars and industry in late 19th Century. Did it get harshly hotter with time? Let's check it out!

## Importing dependencies

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

## Loading temperature data

In [None]:
df_temperature_records = pd.read_csv("../input/temperature-timeseries-for-some-brazilian-cities/station_sao_paulo.csv", )

df_temperature_records.head(10)

## Data cleaning

First thing I noted is that we can use `YEAR` as index in the DataFrame.

In [None]:
df_temperature_records.set_index("YEAR", inplace=True)

df_temperature_records

It looks like that lacking data are identified with `999.90` values. We have to clean it. Let's do a simple workaround by replacing these values.

In [None]:
df_temperature_records.replace(999.90, np.nan, inplace=True)

df_temperature_records

## Data spliting


After taking a look at the dataset, we see that, for a given year, we have temperature by month and a mean temperature for the whole year. So, we can split it in two datasets:

1. A dataset for temperature records by month for each year;

2. A dataset for mean temperature for each year.

Enough talking, let's do it.

In [None]:
columns_for_mean_records = ["metANN"]
df_temperature_records_months = df_temperature_records.loc[:, :"DEC"]  # slicing up to "DEC"
df_temperature_records_mean = df_temperature_records[columns_for_mean_records]

In [None]:
df_temperature_records_months

In [None]:
df_temperature_records_mean

## Plotting temperature records

### Mean temperature

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_temperature_records_mean, legend=False)
plt.ylabel("Mean temperature (degC)")
plt.show()

Please note that NaN values are just ignored in the plots, so we don't need to handle it **in this case**.

### Temperature records at a given year

Let's have a look in the temperature for 2015.

In [None]:
df_selected_temperature_year = df_temperature_records_months[df_temperature_records_months.index == 2015]

df_selected_temperature_year

Well, not good for plotting, but the transposed DataFrame would be great.

In [None]:
df_selected_temperature_year = df_selected_temperature_year.T

df_selected_temperature_year

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_selected_temperature_year, legend=False)
plt.ylabel("Temperature (degC)")
plt.title(u"São Paulo temperature records - 2015")
plt.show()

Not a beautiful plot, though. Would be 2016 similar to 2015? Let's see.

In [None]:
df_selected_temperature_2016 = df_temperature_records_months[df_temperature_records_months.index == 2016]
df_selected_temperature_2016 = df_selected_temperature_2016.T

df_selected_temperature_2016

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_selected_temperature_2016, legend=False)
plt.ylabel("Temperature (degC)")
plt.title(u"São Paulo temperature records - 2016")
plt.show()

Not so different at all. Maybe we can check a year range in a unique plot.

In [None]:
list_of_df_from_2008_to_2018 = list()

for year in range(2008, 2019):
    df_selected_temperature = df_temperature_records_months[df_temperature_records_months.index == year]
    df_selected_temperature = df_selected_temperature.T
    list_of_df_from_2008_to_2018.append(df_selected_temperature)
    
df_from_2008_to_2018 = pd.concat(list_of_df_from_2008_to_2018, axis=1)

df_from_2008_to_2018

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_from_2008_to_2018, dashes=False)
plt.ylabel("Temperature (degC)")
plt.title(u"São Paulo temperature records from 2008 to 2018")
plt.show()

So it looks like that, as one would expect, a pattern happens according to the season.

### Temperature continuous time-series through years

Now, what if we concatenate temperature records for years in a "continuous" fashion? Maybe it can provide a more detailed vision when compared with the mean temperature for each year.

In [None]:
months = list()
temperature_record = list()

first_temperature_record = df_temperature_records_mean.index[0]
last_temperature_record = df_temperature_records_mean.index[-1]
for year in range(first_temperature_record, last_temperature_record):
    df_selected_temperature = df_temperature_records_months[df_temperature_records_months.index == year]
    for month in df_selected_temperature:
        # This inner loop can be improved, for sure
        current_date = f"{month}-{year}"
        temperature_record.append(df_selected_temperature.loc[:, month].values[0])
        months.append(current_date)
        
list_temperature_per_months = list(zip(months, temperature_record))
df_temperature_per_months = pd.DataFrame(list_temperature_per_months, columns=["Time", "Temperature"])

In [None]:
plt.figure(figsize=(26, 6))
sns.lineplot(x="Time", y="Temperature", data=df_temperature_per_months, sort=False)
plt.ylabel("Temperature (degC)")
plt.title(u"São Paulo temperature records from 1946 to 2019")
plt.xticks(rotation=90)
plt.xticks([])
plt.show()

By looking the plot above, we can observe that seasonal effects are present, the pattern is clear. Moreover, it looks like that Sao Paulo mean temperature increased a little through the given time range.