This is a collection of some thematically related datasets that are suitable for different types of regression analysis. Each set of datasets requires a different technique. A suggested question has that can be answered with regression been posed for each dataset.

## Linear regression (predicting a continuous value):

* [CalCOFI: Over 60 years of oceanographic data](https://www.kaggle.com/sohier/calcofi): Is there a relationship between water salinity & water temperature? Can you predict the water temperature based on salinity?
* [Weather in Szeged 2006-2016](https://www.kaggle.com/budincsevity/szeged-weather): Is there a relationship between humidity and temperature? What about between humidity and apparent temperature? Can you predict the apparent temperature given the humidity?
* [Weather Conditions in World War Two](https://www.kaggle.com/smid80/weatherww2/data): Is there a relationship between the daily minimum and maximum temperature? Can you predict the maximum temperature given the minimum temperature? 

## Poisson regression (predicting a count value):

* [Montreal bike lanes: Use of bike lanes in Montreal city in 2015](https://www.kaggle.com/pablomonleon/montreal-bike-lanes): Is there a relationship between the number of bicyclists who use different bike paths on the same day? Can you predict how many riders there will be on one path given how many are on another?
* [New York City - East River Bicycle Crossings: Daily bicycle counts for major bridges in NYC](https://www.kaggle.com/new-york-city/nyc-east-river-bicycle-crossings): Is there a relationship between the number of bicyclists who cross different bridges in New York?
* (Requires some cleaning) [UK 2016 Road Safety Data: Data from the UK Department for Transport](https://www.kaggle.com/bluehorseshoe/uk-2016-road-safety-data/data) : Is there a relationship between the number of people in the car and the number of casualties in road accidents?

## Logistic regression (predicting a categorical value, often with two categories):

* [The Ultimate Halloween Candy Power Ranking](https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/): Can you predict if a candy is chocolate or not based on its other features?
* [Epicurious - Recipes with Rating and Nutrition](https://www.kaggle.com/hugodarwood/epirecipes): Can you predict whether a recipe was part of #cakeweek based on whether it its other features?
* (Requires some cleaning) [Competition context and results from 1,559 Kansas City Barbecue Society Barbeque Competitions](https://www.kaggle.com/jaysobel/kcbs-bbq/):: Can you model whether a team will win first place based on their score and the competition they’re at?

In [3]:
import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# read in all our data
df_bottle = pd.read_csv("../input/bottle.csv")
df_cast = pd.read_csv("../input/cast.csv")

# set seed for reproducibility
np.random.seed(0)

In [4]:
df_bottle.info()

In [19]:
df_bottle['Salnty']

In [8]:
print(df_bottle.shape[0])
missing_values_count = df_bottle.isnull().sum()
missing_values_count

Scale Depth 

In [9]:
# select the  depth in meters, a linear parameter  that increases as we go down
depth_m = df_bottle.Depthm

# scale the Depth from 0 to 1
scaled_data = minmax_scaling(depth_m, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(df_bottle.Depthm, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")



Normalize depth to compare with scaling

In [10]:
# get the index of all positive depth (Box-Cox only takes postive values)
index_of_positive_depth  = df_bottle.Depthm > 0

# get only positive depth (using their indexes)
positive_depth = df_bottle.Depthm.loc[index_of_positive_depth]

# normalize the depth (w/ Box-Cox)
normalized_depth = stats.boxcox(positive_depth)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_depth, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_depth, ax=ax[1])
ax[1].set_title("Normalized data")

Scale T_degC

In [22]:
#subset_nfl_data.fillna(0)
df_bottle_na_filled = df_bottle.fillna(0)

In [13]:
# select T_degC a linear parameter 

TdegC = df_bottle_na_filled.T_degC

# scale the T_degC from 0 to 1
scaled_data = minmax_scaling(TdegC, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(df_bottle_na_filled.T_degC, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Normalize  T_degC

In [14]:
# get the index of all positive T_degC (Box-Cox only takes postive values)
index_of_positive_T_degC  = df_bottle_na_filled.T_degC > 0

# get only positive T_degC (using their indexes)
positive_T_degC = df_bottle_na_filled.T_degC.loc[index_of_positive_T_degC]

# normalize the T_degC (w/ Box-Cox)
normalized_T_degC = stats.boxcox(positive_T_degC)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_T_degC, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_T_degC, ax=ax[1])
ax[1].set_title("Normalized data")

Scale Salnty

In [17]:
# select Salnty a linear parameter  

Salnty_m = df_bottle_na_filled.Salnty

# scale the T_degC from 0 to 1
scaled_data = minmax_scaling(Salnty_m, columns = [0])

# plot the original & scaled data together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(df_bottle_na_filled.Salnty, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Normalize Salnty

In [23]:
# get the index of all positive Salnty (Box-Cox only takes postive values)
index_of_positive_Salnty  = df_bottle_na_filled.Salnty > 0

# get only positive Salnty (using their indexes)
positive_Salnty = df_bottle_na_filled.Salnty.loc[index_of_positive_Salnty]

# normalize the Salnty (w/ Box-Cox)
normalized_Salnty = stats.boxcox(positive_Salnty)[0]

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(positive_Salnty, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_Salnty, ax=ax[1])
ax[1].set_title("Normalized data")

Analyzed 3 fields Depthm, T_degC,  Salnty.  Inference is that Depthm need to be scaled, whereas T_degC and Salnty need to be normalized.  Data set is filled with 0 for all NAN values in T_degC,  Salnty
Note:  x-axis range is high in case of normalization of Salnty.  why this is so is still to be analyzed