# Inital Exploration of Data

## Data Set

In [32]:
import pandas as pd
wine_data = pd.read_csv("wineData.csv")

In [8]:
wine_data.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


### Notes: 
Data set contains information on two types of wine, white and red. There are the chemical properties that make up that wine. Then, there is a wine quality rating for each of the wines.  

In [7]:
wine_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
type                    6497 non-null object
fixed acidity           6487 non-null float64
volatile acidity        6489 non-null float64
citric acid             6494 non-null float64
residual sugar          6495 non-null float64
chlorides               6495 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6488 non-null float64
sulphates               6493 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 659.9+ KB


### Notes: 
There are missing values in several of the fields. Most of the data is numerical

In [19]:
wine_data["type"].value_counts()

white    4898
red      1599
Name: type, dtype: int64

### Notes: 
There is alot more white wine data than red wine. When doing the linear regression I might have to split the white and red wine data into different sets to get an better linear regression model since there chemical properities might be different therefore yielding different quality ratings. 

## Predicting
I would like to be able to predict the quality of the data using the chemical properties. That way you could see which properties made the most difference in the wine quality rating. Eventually, try to classify the type of wine using the chemical properties too. 

## Clean Data

In [24]:
wine_data2 = wine_data.copy()
fixed_acidity_mean = wine_data2["fixed acidity"].mean()
volatile_acidity_mean = wine_data2["volatile acidity"].mean()
citric_acid_mean = wine_data2["citric acid"].mean()
residual_sugar_mean = wine_data2["residual sugar"].mean()
chlorides_mean = wine_data2["chlorides"].mean()
pH_mean = wine_data2["pH"].mean()
sulphates_mean = wine_data2["sulphates"].mean()

wine_data2["fixed acidity"].fillna( value=fixed_acidity_mean, inplace=True)
wine_data2["volatile acidity"].fillna( value=volatile_acidity_mean, inplace=True)
wine_data2["citric acid"].fillna( value=citric_acid_mean, inplace=True)
wine_data2["residual sugar"].fillna( value=residual_sugar_mean, inplace=True)
wine_data2["chlorides"].fillna( value=chlorides_mean, inplace=True)
wine_data2["pH"].fillna( value=pH_mean, inplace=True)
wine_data2["sulphates"].fillna( value=sulphates_mean, inplace=True)

wine_data2.head(20)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,white,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,white,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


In [28]:
wine_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
type                    6497 non-null object
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
dtypes: float64(11), int64(1), object(1)
memory usage: 659.9+ KB


### Notes: 
All of the missing data has been filled by using the mean of that column

In [35]:
import matplotlib.pyplot as plt

wine_data2.hist(bins=10)
plt.show()

## Creating Training/ Test Sets

In [26]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(wine_data2, test_size=0.2, random_state=123)
print(len(train_set), len(test_set))
print(train_set.head())
print(test_set.head())

5197 1300
       type  fixed acidity   ...     alcohol  quality
6452    red            6.6   ...        11.0        6
5110    red           11.6   ...        10.2        6
2792  white            6.8   ...         9.4        5
1879  white            7.2   ...         9.2        6
2742  white            8.0   ...         9.5        6

[5 rows x 13 columns]
       type  fixed acidity   ...     alcohol  quality
1321  white            7.3   ...        13.2        6
2767  white            7.9   ...         9.5        6
5069    red            8.0   ...         9.2        6
5780    red            8.4   ...        12.0        6
547   white            7.7   ...        11.8        6

[5 rows x 13 columns]
