# **Reading a CSV File using Pandas**

In [40]:
import pandas as pd

In [41]:
df = pd.read_csv('../input/diabetes-dataset-trial/diabetes.csv')

In [42]:
df.head()

# **Finding missing values in dataset**
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b

**Standard Missing Values**

In [43]:
df['Pregnancies']

In [44]:
df['Pregnancies'].isnull()

**Non-Standard Missing Values**

In [45]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("../input/diabetes-dataset-trial/diabetes.csv", na_values = missing_values)

In [46]:
print(df['Glucose'])
print(df['Glucose'].isnull())

**Unexpected Missing Values**

In [47]:
cnt=0
for row in df['Insulin']:
    try:
        int(row)
        pass
    except ValueError:
        df.loc[cnt, 'Insulin']=np.nan
    cnt+=1

In [48]:
print(df['Insulin'])

**Total missing values for each feature**

In [49]:
print(df.isnull().sum())

**Any missing values?**

In [50]:
print(df.isnull().values.any())

**Total number of missing values**

In [51]:
print(df.isnull().sum().sum())

# **Dealing with Missing Data**

**Filling in missing values with a single value**

In [52]:
# Replace missing values with a number
df['Age'].fillna(0,inplace=True)

**Location based replacement**

In [53]:
df.loc[2,'Age']

In [54]:
df.loc[2,'Age'] = 21

In [55]:
df.loc[2,'Age']

**Replace using median**

In [56]:
median = df['Age'].median()
median

In [57]:
df['Age'].fillna(median, inplace=True)

# **Calculating Correlation betwwen Attributes**
https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

**Syntax of dataframe.corr()**

**Syntax:** DataFrame.corr(self, method=’pearson’, min_periods=1) 

**Parameters:** 

**method :** 
* pearson: standard correlation coefficient 
* kendall: Kendall Tau correlation coefficient 
* spearman: Spearman rank correlation

**min_periods :** Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation 

**Returns:** count :y : DataFrame

In [58]:
df.corr(method='pearson')

# **Heatmap for the data**
https://www.geeksforgeeks.org/ml-matrix-plots-in-seaborn/

Heatmap is a way to show some sort of matrix plot. To use a heatmap the data should be in a matrix form. By matrix we mean that the index name and the column name must match in some way so that the data that we fill inside the cells are relevant.

In [59]:
import seaborn as sns
sns.heatmap(df.corr(method='pearson'))

**Attributes in Heatmap()**

* annot is used to annotate the actual value that belongs to these cells
* cmap is used for the colour mapping you want like coolwarm, plasma, magma etc.
* linecolor is used to set the colour of the lines separating the cells.
* linewidth is used to set the width of the lines separating the cells.


In [60]:
sns.heatmap(df.corr(method='pearson'), annot=True, cmap='magma', linecolor='black', linewidth=1)

In [61]:
sns.heatmap(df.corr(method='kendall'), annot=True, cmap='magma', linecolor='black', linewidth=1)

In [62]:
sns.heatmap(df.corr(method='spearman'), annot=True, cmap='magma', linecolor='black', linewidth=1)

# **Plotting Visual Map for the given Dataset**

In [63]:
sns.pairplot(df)

random forest
logistic regression
SVM
MLP
decision tree
naive bayes classifier

MSE
Confusion Matrix
accuracy
