## 1. Introdução
<p>Este <i>dataset</i> contém dados meteorológicos do Aeroporto Internacional de Raleigh Durham retirados do serviço da Web do NOAA.</p>

## 2. Lendo os Dados

In [2]:
import pandas as pd

# As colunas estão separadas pelo símbolo ';' #. 
weather_history = pd.read_csv('rdu-weather-history.csv', sep=';')

weather_history.head()

Unnamed: 0,date,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,...,drizzle,snow,freezingrain,smokehaze,thunder,highwind,hail,blowingsnow,dust,freezingfog
0,2009-10-03,55.0,82.0,0.0,0.0,0.0,2.91,240.0,16.11,230.0,...,No,No,No,No,No,No,No,No,No,No
1,2009-10-10,59.0,79.0,0.02,0.0,0.0,7.83,220.0,17.0,220.0,...,No,No,No,No,No,No,Yes,No,No,No
2,2009-10-14,46.9,61.0,0.14,0.0,0.0,8.72,40.0,14.99,50.0,...,Yes,No,No,No,No,No,Yes,No,No,No
3,2009-10-17,45.0,57.9,0.0,0.0,0.0,6.26,30.0,14.09,40.0,...,No,No,No,No,No,No,No,No,No,No
4,2009-10-29,48.0,68.0,0.0,0.0,0.0,5.82,80.0,14.99,70.0,...,No,No,No,No,No,No,No,No,No,No


## 3. Visão geral
<p>O <i>dataset</i> contém informação sobre dados meteorológicos do Aeroporto Internacional de Raleigh Durham, desde 2007. Possui dados como temperaturas mínimas e máximas, volume de chuva, velocidade do vento e etc.</p>

In [3]:
# Número de entradas
num_history = weather_history.shape[0]
print("Número de linhas do dataset:", num_history)

print("\n\nResumo estatístico do DataFrame:")
weather_history.describe()

Número de linhas do dataset: 4137


Resumo estatístico do DataFrame:


Unnamed: 0,temperaturemin,temperaturemax,precipitation,snowfall,snowdepth,avgwindspeed,fastest2minwinddir,fastest2minwindspeed,fastest5secwinddir,fastest5secwindspeed
count,4136.0,4136.0,4136.0,4135.0,4136.0,4134.0,4135.0,4135.0,4118.0,4118.0
mean,50.540063,72.017021,0.12663,0.012965,0.017384,5.860614,172.541717,15.957151,177.056824,21.80161
std,16.229527,16.530515,0.371318,0.195214,0.213953,2.958446,94.603272,5.270319,96.850988,7.096004
min,4.1,23.2,0.0,0.0,0.0,0.0,10.0,4.92,10.0,6.93
25%,37.0,60.1,0.0,0.0,0.0,3.58,80.0,12.97,90.0,17.0
50%,52.0,73.9,0.0,0.0,0.0,5.37,210.0,14.99,210.0,21.03
75%,64.9,86.0,0.04,0.0,0.0,7.61,240.0,18.12,240.0,25.05
max,80.1,105.1,6.45,6.69,5.91,19.01,360.0,59.95,360.0,86.12


## 3.1 Utilizando a função Seaborn de mapas de calor
<p>Seaborn é uma biblioteca em python para criação de visualizações estatísticas.</p>
<p>O método ```.corr() ```, calcula o <b>coeficiente de correlação de Pearson</b> entre todos os pares de colunas numéricas do DataFrame.</p>
<p>Interpretando o coeficiente:
- 0.9 para mais ou para menos indica uma correlação muito forte.
- 0.7 a 0.9 positivo ou negativo indica uma correlação forte.
- 0.5 a 0.7 positivo ou negativo indica uma correlação moderada.
- 0.3 a 0.5 positivo ou negativo indica uma correlação fraca.
- 0 a 0.3 positivo ou negativo indica uma correlação desprezível.

In [4]:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

plt.figure(figsize=(18,10))
weather_map = sns.heatmap(weather_history.corr(), annot=True, square=True, cmap="YlGnBu", linewidths=.3)


## 4. Logistic Regression
<p>'Logistic Regression' é um algoritmo de <b>classificação</b>. Ele é usado para prever um resultado binário (1/0, Sim / Não, Verdadeiro / Falso), dado um conjunto de variáveis independentes.</p>

In [10]:
import sklearn.linear_model

#clean_weather_history = weather_history.drop(weather_history.columns[9:], axis=1)
#clean_weather_history = clean_weather_history.drop(['date'], axis=1)
#clean_weather_history = weather_history.dropna()

clean_weather_history = weather_history.replace('No', 0.0)
clean_weather_history = clean_weather_history.replace('Yes', 1.0)

model=sklearn.linear_model.LogisticRegression()

predictors =['drizzle','thunder', 'snow', 'fog', 'mist', 'hail']

x_train = clean_weather_history[predictors].values
y_train = clean_weather_history['rain'].values

model = sklearn.linear_model.LogisticRegression()
model.fit(x_train, y_train)

x_test = clean_weather_history[predictors].values

predicted= model.predict(x_test)

clean_weather_history['rain']=predicted

print("Results Predicted:\n", clean_weather_history['rain'])
print("\nOriginal dataset, 'rain' column:\n", weather_history['rain'])

Results Predicted:
 0       0.0
1       1.0
2       1.0
3       0.0
4       0.0
5       0.0
6       1.0
7       0.0
8       0.0
9       1.0
10      1.0
11      1.0
12      0.0
13      0.0
14      1.0
15      0.0
16      0.0
17      0.0
18      0.0
19      0.0
20      0.0
21      1.0
22      0.0
23      0.0
24      0.0
25      0.0
26      0.0
27      1.0
28      1.0
29      0.0
       ... 
4107    0.0
4108    0.0
4109    0.0
4110    0.0
4111    0.0
4112    0.0
4113    0.0
4114    0.0
4115    0.0
4116    0.0
4117    0.0
4118    0.0
4119    0.0
4120    0.0
4121    0.0
4122    0.0
4123    0.0
4124    0.0
4125    0.0
4126    0.0
4127    0.0
4128    0.0
4129    0.0
4130    0.0
4131    0.0
4132    0.0
4133    0.0
4134    0.0
4135    0.0
4136    0.0
Name: rain, Length: 4137, dtype: float64

Original dataset, 'rain' column:
 0        No
1       Yes
2       Yes
3        No
4        No
5        No
6       Yes
7        No
8        No
9       Yes
10      Yes
11      Yes
12      Yes
13       No
14  

## 5. Referências   

> [Coeficiente de correlação de Pearson](https://pt.wikipedia.org/wiki/Coeficiente_de_correla%C3%A7%C3%A3o_de_Pearson#Refer%C3%AAncias)

> [Why isn't Logistic Regression called Logistic Classification?](https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification)



