### Data Quality: 
Data quality refers to the accuracy, completeness, consistency, reliability, and timeliness of data. It is a measure of the degree to which data meets the requirements and expectations of its intended use. High-quality data is crucial for making informed decisions, conducting meaningful analysis, and deriving actionable insights. Poor data quality can lead to errors, inefficiencies, and inaccuracies in business operations, decision-making processes, and analytical outcomes. 
### Issues with Data Quality:

1. Incomplete Data: Incomplete data occurs when certain data points are missing or not recorded. This can result in gaps in analysis and decision-making, leading to incomplete insights and inaccurate conclusions.

2. Inaccurate Data: Inaccurate data contains errors, inconsistencies, or inaccuracies, which can arise from data entry mistakes, outdated information, or data integration issues. Inaccurate data can lead to misguided decisions and unreliable analysis.

3. Inconsistent Data: Inconsistent data refers to data that varies in format, structure, or values across different sources or instances. Inconsistencies can make it challenging to perform meaningful analysis and can lead to discrepancies in reporting and decision-making.

4. Duplicate Data: Duplicate data occurs when the same information is recorded multiple times within a dataset. Duplicate records can skew analysis results, inflate metrics, and waste storage space.

5. Outdated Data: Outdated data refers to information that is no longer current or relevant. Using outdated data for analysis or decision-making can lead to inaccurate insights and ineffective strategies.

### Missing values and its reason

1. Data Entry Errors: Human error during data entry processes can lead to missing values when information is not accurately recorded or inputted.

2. System Failures:Software or hardware failures can result in missing values if data is not properly saved or recorded due to system errors or malfunctions.

3. Non-response:Missing values may occur when respondents choose not to answer certain questions in surveys or questionnaires, resulting in incomplete data.

4. Data Transformation Issues: Missing values can arise during data transformation processes, such as merging datasets or converting data between formats, due to inconsistencies or mismatches.

5. Privacy Concerns:In order to protect privacy and confidentiality, certain data fields may be intentionally left blank or masked, resulting in missing values in the dataset.
 
### Techniques to deal with missing values:

1. Interpolation: Interpolation involves estimating missing values by interpolating between existing data points. This method is commonly used for time series data, where missing values are filled in based on the values of neighboring time points.

2. Group Statistics: Group statistics involve filling in missing values with summary statistics (e.g., mean, median, mode) calculated from groups or clusters of similar observations. This approach accounts for variability within different subgroups of the data.

3. Forward or Backward Fill:Forward fill involves filling missing values with the most recent non-missing value in the dataset, while backward fill involves filling missing values with the next non-missing value. This method is useful for data with a temporal or sequential structure.

4. K-nearest Neighbors (KNN) Imputation:KNN imputation involves estimating missing values based on the values of nearest neighbors in the feature space. This method is particularly useful for datasets with complex relationships between variables.

5. Missing Value Indicators:Instead of filling in missing values directly, missing value indicators can be created to flag missing values in the dataset. This approach allows analysts to explicitly account for missingness in their analyses and models.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'Diabetic']
# load dataset and attach corresponding label to each column of the raw data
dbts_ds= pd.read_csv('C:/Users/acer/nikhil/DataMiningLab/lab/pima_diabetes.csv', header=None, names=col_names)
dbts_ds.head()  # display first 40 rows of a dataset

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Identify duplicate values in a dataset 
True for duplicate and false for unique rows. If two or more rows refers to identical objects and the attribute vaue are exaclty similar then we can simply remove the duplicated rows.

In [3]:
dbts_ds.duplicated()  # "object.duplicated" function to idenftify duplicate rows

0      False
1      False
2      False
3      False
4      False
       ...  
763    False
764    False
765    False
766    False
767    False
Length: 768, dtype: bool

> #### Since each of the column values came as false we can say that there are no duplicate values

## Identify Missing values in a dataset

### 2. 1. 1. First Identify the spots in the dataset where missing values are present:

   #### 2.1.1.1. Mark missing values as "NaN" in  rows or columns of the dataset. Sum, count etc operations ignores NAN values. 
   ****
   By  using the "replace()" function of  Pandas DataFrame we can mark the  missing values as "NAN" in each columns.

   #### 2.1.1.1.2. Then we can use "isnull()" function to mark  all "NAN" values in the dataset  as True and based on it we can count total number  of missing values in each column. Then replace "0"  with "NAN"


In [4]:
dbts_new  = dbts_ds.copy(deep = True)
dbts_new [['glucose','bp','skin','insulin','bmi']] = dbts_new [['glucose','bp','skin','insulin','bmi']].replace(0,np.NaN)
print("below table shows we marked the feature value from glucose to bmi as NAN in missing fields in PIMA_NEW dataset")
print(dbts_new.isnull().sum())

below table shows we marked the feature value from glucose to bmi as NAN in missing fields in PIMA_NEW dataset
pregnant      0
glucose       5
bp           35
skin        227
insulin     374
bmi          11
pedigree      0
age           0
Diabetic      0
dtype: int64


In [5]:
print(dbts_new.head())   # print the first 5 rows with values of pima origianl

   pregnant  glucose    bp  skin  insulin   bmi  pedigree  age  Diabetic
0         6    148.0  72.0  35.0      NaN  33.6     0.627   50         1
1         1     85.0  66.0  29.0      NaN  26.6     0.351   31         0
2         8    183.0  64.0   NaN      NaN  23.3     0.672   32         1
3         1     89.0  66.0  23.0     94.0  28.1     0.167   21         0
4         0    137.0  40.0  35.0    168.0  43.1     2.288   33         1


### 2.2 Handle Missing values 

### 2.2.1. Eliminate rows containing missing values - 
#### Though this approach is not suitable in many practical cases, it is preferred if  only a few rows (that represents each object in a data set) have missing values. However its impractical to remove the rows when most records are missing.   

> For eliminating rows with missing values use "object.dropna()" method but it requires all missing values to be replaced by "NAN" first which we have already done previously

In [6]:
dbts_ds[['glucose','bp','skin','insulin','bmi']] = dbts_ds[['glucose','bp','skin','insulin','bmi']].replace(0, np.nan)
dbts_ds.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [8]:
cln_data  = dbts_ds.dropna()  # eliminate rows containing missing values
cln_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
13,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


In [9]:
cln_data.shape  ## now the daatset contains only 392 samples

(392, 9)

### 2.2.2. Missing values can be repalced by mean, median , quartiles or based on the type and nature of attribute values
#### i.e whether the attribute or the column data is continuous, categorical or the similarity values of the observed data. Its also equally important to take into account the effect on accuracy of the learning algorithm based on the imputation approach.

*** 
<font color = red>From the histogram of exploratory data analysis, "glucose", "bmi", "skin" features are normally distributed  so we replace mean value in the missing elelemnt part while  "insulin" and "bp" are skewed so we replace meadian value in the missing part.  </font>
##### Filling the mean and median value  according to corresponding histogram distribution in the missing part  using fillna() method

## Impute missing values through measures of central tendency based on feature histogram of lab 1 EDA 

In [10]:
dbts_new.fillna({"glucose":dbts_new['glucose'].mean(), "bp": dbts_new['bp'].median(), 
                  "skin": dbts_new['skin'].mean(),"insulin": dbts_new['insulin'].median(),
             "bmi": dbts_new['bmi'].median() },inplace=True)

In [14]:
dbts_new   # show newly imputed values in corresponsing misisng place as a result of above code

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,6,148.0,72.0,35.00000,125.0,33.6,0.627,50,1
1,1,85.0,66.0,29.00000,125.0,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,125.0,23.3,0.672,32,1
3,1,89.0,66.0,23.00000,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.00000,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.00000,180.0,32.9,0.171,63,0
764,2,122.0,70.0,27.00000,125.0,36.8,0.340,27,0
765,5,121.0,72.0,23.00000,112.0,26.2,0.245,30,0
766,1,126.0,60.0,29.15342,125.0,30.1,0.349,47,1


## Save cleaned dataset

In [11]:
dbts_new.to_csv('/Users/pramilakhadka/Desktop/imputed_data_diabetes.csv')