###**Data Quality**
Data quality refers to the accuracy, consistency, reliability, and relevance of data for its intended use. High-quality data should be complete, precise, timely, and appropriately formatted for analysis or decision-making.</font>

### **Issues with Data Quality:**
1. Inaccuracies: Errors in data, such as incorrect values or mislabeling.

2. Inconsistency: Data not adhering to a uniform format or standard across different sources.

3. Incomplete data: Missing or missing data points that lead to gaps.

4. Duplicates: Multiple records representing the same entity.

5. Outliers: Data points that are significantly different from others and may skew analysis.

6. Timeliness: Data that is outdated or irrelevant to current conditions.

### **Missing values and its reason**:Missing values refer to the absence of data in one or more attributes (columns). Reasons for missing values can include:

1. Non-response: In surveys or questionnaires, some responses might be skipped.
Errors in data collection: Data might not be recorded properly due to technical issues.

2. Data entry mistakes: Human errors or mistakes during data entry can leave values blank.

3. System limitations: Certain systems might not collect all attributes for every data point.

4. Not applicable: In some cases, a value may not apply to all records (e.g., age for a specific category).

### **Techniques to deal with missing values**:
1. Ignoring Tuple:
When dealing with missing values, sometimes the missing data can be ignored for certain operations or analyses, especially if the missing data is minimal or irrelevant.

2. Filling Manually:
Manual filling refers to the process of replacing missing values with specific, often domain-specific, values that you manually provide based on your knowledge or business rules.

3. Using Global Constant:
A global constant refers to filling missing values with a single constant value that is used across the entire dataset. Common constants include the mean, median, mode, or a specific placeholder (like "0" or "unknown").

4. Using Most Probable Value:
This refers to filling missing values by determining the most likely or probable value based on the relationships in the dataset, often using regression models or Bayesian inference. It predicts missing data based on other available features.




In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'Diabetic']
# load dataset and attach corresponding label to each column of the raw data
dbts_ds= pd.read_csv('/content/pima_diabetes.csv', header=0, names=col_names)
dbts_ds.head()  # display first 40 rows of a dataset

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


##**Identify duplicate values in a dataset.**
True for duplicate and false for unique rows. If two or more rows refers to identical objects and the attribute vaue are exaclty similar then we can simply remove the duplicated rows.

In [3]:
dbts_ds.duplicated()  # "object.duplicated" function to idenftify duplicate rows

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
762,False
763,False
764,False
765,False


 #### Since each of the column values came as false we can say that there are no duplicate values

## **Identify Missing values in a dataset**

### 2. 1. 1. First Identify the spots in the dataset where missing values are present:

   #### 2.1.1.1. Mark missing values as "NaN" in  rows or columns of the dataset. Sum, count etc operations ignores NAN values.
   ****
   By  using the "replace()" function of  Pandas DataFrame we can mark the  missing values as "NAN" in each columns.

   #### 2.1.1.1.2. Then we can use "isnull()" function to mark  all "NAN" values in the dataset  as True and based on it we can count total number  of missing values in each column. Then replace "0"  with "NAN"


In [4]:
dbts_new  = dbts_ds.copy(deep = True)
dbts_new [['glucose','bp','skin','insulin','bmi']] = dbts_new [['glucose','bp','skin','insulin','bmi']].replace(0,np.NaN)
print("below table shows we marked the feature value from glucose to bmi as NAN in missing fields in PIMA_NEW dataset")
print(dbts_new.isnull().sum())

below table shows we marked the feature value from glucose to bmi as NAN in missing fields in PIMA_NEW dataset
pregnant      0
glucose       5
bp           35
skin        227
insulin     373
bmi          11
pedigree      0
age           0
Diabetic      0
dtype: int64


In [5]:
print(dbts_new.head())   # print the first 5 rows with values of pima origianl

   pregnant  glucose    bp  skin  insulin   bmi  pedigree  age  Diabetic
0         1     85.0  66.0  29.0      NaN  26.6     0.351   31         0
1         8    183.0  64.0   NaN      NaN  23.3     0.672   32         1
2         1     89.0  66.0  23.0     94.0  28.1     0.167   21         0
3         0    137.0  40.0  35.0    168.0  43.1     2.288   33         1
4         5    116.0  74.0   NaN      NaN  25.6     0.201   30         0


### **2.2 Handle Missing values --many ways to deal with missing values. But the goal is whatever approach we take, our decision must be accurate or as close to accuracy as if there were real data values in the missing part.**

### **2.2.1. Eliminate rows containing missing values -**
#### Though this approach is not suitable in many practical cases, it is preferred if  only a few rows (that represents each object in a data set) have missing values. However its impractical to remove the rows when most records are missing.

> For eliminating rows with missing values use "object.dropna()" method but it requires all missing values to be replaced by "NAN" first which we have already done previously

In [6]:
dbts_ds[['glucose','bp','skin','insulin','bmi']] = dbts_ds[['glucose','bp','skin','insulin','bmi']].replace(0, np.nan)
dbts_ds.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,85.0,66.0,29.0,,26.6,0.351,31,0
1,8,183.0,64.0,,,23.3,0.672,32,1
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
4,5,116.0,74.0,,,25.6,0.201,30,0


In [7]:
cln_data  = dbts_ds.dropna()  # eliminate rows containing missing values
cln_data.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
2,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
3,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
12,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


In [8]:
cln_data.shape  ## now the daatset contains only 392 samples

(392, 9)

###   **2.2.2. Missing values can be repalced by mean, median , quartiles or based on the type and nature of attribute values**
#### i.e whether the attribute or the column data is continuous, categorical or the similarity values of the observed data. Its also equally important to take into account the effect on accuracy of the learning algorithm based on the imputation approach.

***
<font color = red>From the histogram of exploratory data analysis, "glucose", "bmi", "skin" features are normally distributed  so we replace mean value in the missing elelemnt part while  "insulin" and "bp" are skewed so we replace meadian value in the missing part.  </font>
##### Filling the mean and median value  according to corresponding histogram distribution in the missing part  using fillna() method

## **Impute missing values through measures of central tendency based on feature histogram of lab 1 EDA**

In [9]:
dbts_new.fillna({"glucose":dbts_new['glucose'].mean(), "bp": dbts_new['bp'].median(),
                  "skin": dbts_new['skin'].mean(),"insulin": dbts_new['insulin'].median(),
             "bmi": dbts_new['bmi'].median() },inplace=True)

In [10]:
dbts_new   # show newly imputed values in corresponsing misisng place as a result of above code

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,Diabetic
0,1,85.0,66.0,29.000000,125.0,26.6,0.351,31,0
1,8,183.0,64.0,29.142593,125.0,23.3,0.672,32,1
2,1,89.0,66.0,23.000000,94.0,28.1,0.167,21,0
3,0,137.0,40.0,35.000000,168.0,43.1,2.288,33,1
4,5,116.0,74.0,29.142593,125.0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101.0,76.0,48.000000,180.0,32.9,0.171,63,0
763,2,122.0,70.0,27.000000,125.0,36.8,0.340,27,0
764,5,121.0,72.0,23.000000,112.0,26.2,0.245,30,0
765,1,126.0,60.0,29.142593,125.0,30.1,0.349,47,1


## Save cleaned dataset

In [11]:
dbts_new.to_csv('/content/cleaned_pima_diabetes.csv')