**WELCOME TO MY NOTEBOOK**

This notebook is an introduction of how to use Random Forest for determining the quality of wine and how to choose appropriate parameters for fitting our model.

**IMPORTING PACKAGES REQUIRED **

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

import scipy.stats as stats 
from scipy.stats.stats import pearsonr

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns

**IMPORTING MY DATA**

In [None]:
wine=pd.read_csv("../input/winequality-red.csv")
wine.head()

**LOOKING INTO DATA**

In [None]:
wine.shape

In [None]:
wine.describe()#no missing values

In [None]:
wine.columns=['fixed_acidity','volatile_acidity','citric_acid','residual_sugar','chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density','pH','sulphates','alcohol','quality']


**FEATURE FINDING**
We are going to use ANOVA test to determine weather to us use a paticular column as a feature or not.

ANOVA ia a parametric test so the distribution must be normal for it to give reliablr result.
Here we are plotting all the tentetive features to see their distribution.  

In [None]:
def hist_plotter(wine,column):
     wine[column].plot.hist(figsize=(10,5))
     plt.xlabel("Rating",fontsize=10)
     plt.ylabel(column,fontsize=10)
     plt.title(column+" vs Rating",fontsize=10)
     plt.show()

we are writing a function to reduce our work or writing plt again and again.

In [None]:
def skew(wine,col):
    wine[col] = np.log(wine[col])
    

This is a skewness correction function.we will see its use very soon

We ca see the below plotted graph is not accurately bell shaped but skewed towards left. 

This lead to an incorrect result in ANOVA .

In [None]:

hist_plotter(wine,'volatile_acidity')


In [None]:
skew(wine,'volatile_acidity')
hist_plotter(wine,'volatile_acidity')


We are trying to correct the skewness problem by taking log transform of it whenever it is necessary.

In [None]:
hist_plotter(wine,'citric_acid')


In [None]:
skew(wine,'residual_sugar')
hist_plotter(wine,'residual_sugar')

In [None]:
skew(wine,'chlorides')
hist_plotter(wine,'chlorides')

In [None]:
skew(wine,'free_sulfur_dioxide')
hist_plotter(wine,'free_sulfur_dioxide')

In [None]:
skew(wine,'total_sulfur_dioxide')
hist_plotter(wine,'total_sulfur_dioxide')

In [None]:
hist_plotter(wine,'density')

In [None]:
hist_plotter(wine,'pH')

In [None]:
skew(wine,'sulphates')
hist_plotter(wine,'sulphates')

In [None]:
skew(wine,'alcohol')
hist_plotter(wine,'alcohol')

**ANOVA TEST**

We are running anova test and trying to determine weather it is a good feature or not.

A good feature is somthing which help us determine different labels.(In this case what makes a wine good)

In an ANOVA test we will look into F-Statistic value .

A high F value shows variation between different groups are higher than variation among same group.

In [None]:
model=ols('chlorides ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('fixed_acidity ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('free_sulfur_dioxide ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) 

In [None]:
model=ols('citric_acid ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('residual_sugar ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) 

In [None]:
model=ols('total_sulfur_dioxide~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('density ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('pH ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) 

In [None]:
model=ols('sulphates ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('alcohol ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

In [None]:
model=ols('volatile_acidity ~ quality',data=wine).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print (aov_table) #take

Like other parametric tests, the analysis of variance assumes that the data fit the normal distribution. If your measurement variable is not normally distributed, you may be increasing your chance of a false positive result if you analyze the data with an anova or other test that assumes normality. Fortunately, an anova is not very sensitive to moderate deviations from normality; simulation studies, using a variety of non-normal distributions, have shown that the false positive rate is not affected very much by this violation of the assumption (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996). This is because when you take a large number of random samples from a population, the means of those samples are approximately normally distributed even when the population is not normal.

It is possible to test the goodness-of-fit of a data set to the normal distribution. I do not suggest that you do this, because many data sets that are significantly non-normal would be perfectly appropriate for an anova.

Instead, if you have a large enough data set, I suggest you just look at the frequency histogram. If it looks more-or-less normal, go ahead and perform an anova. If it looks like a normal distribution that has been pushed to one side, like the sulphate data above, you should try different data transformations and see if any of them make the histogram look more normal. If that doesn't work, and the data still look severely non-normal, it's probably still okay to analyze the data using an anova. However, you may want to analyze it using a non-parametric test. Just about every parametric statistical test has a non-parametric substitute, such as the Kruskal–Wallis test instead of a one-way anova, Wilcoxon signed-rank test instead of a paired t-test, and Spearman rank correlation instead of linear regression. These non-parametric tests do not assume that the data fit the normal distribution. They do assume that the data in different groups have the same distribution as each other, however; if different groups have different shaped distributions (for example, one is skewed to the left, another is skewed to the right), a non-parametric test may not be any better than a parametric one.

For now we will use them as it is.

Now we are done with selecting Features for our model.

**FINAL FEATURES**

Now we will drp all the undesirable features whic will not give information about wine quality.

and we place " quality " in y which we have to predict.

In [None]:
y=wine['quality']
wine=wine.drop(['quality','pH','residual_sugar','free_sulfur_dioxide'],axis=1)

In [None]:
wine.head()

**DATA SPLIT**

Now we will split data in train and test data .

In [None]:
X_train, X_test, y_train, y_test = train_test_split(wine, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

**MODEL FITTING**

E verthing is done now we will fit our model with features.

In [None]:
clf = RandomForestClassifier(n_estimators=200)#how to set n_estimator
clf.fit(X_train, y_train)

Prediction of our test set is done here

In [None]:
pred=clf.predict(X_test)

In [None]:
print(classification_report(y_test, pred))

**RESULT**

We can see our accuracy is **~70%**



In [None]:
acc = clf.score(X_test,y_test)
acc

We can see our confusion matrix here.

In [None]:
print(confusion_matrix(y_test, pred))

**THANK YOU FOR READING MY NOTEBOOK**

**PLEASE UPVOTE SO THAT THIS WILL REACH BEGINNERS AND HELP THEM**

FEEL FREE TO DROP A REVIEW IN COMMENTS