## What are the different types of data issues?

## Incomplete data

* This is data with missing fields.

## Duplicated entries

* Duplicate data is any entry that inadvertently shares data with another entry in a database. Carbon copy of a ror/column

## Invalid Data

* Wrong datatypes to represent the data e.g datetime

## Conflicting data

* using different cases to represent the same thing for instant.

## Import Python Libraries

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## Load the data

In [6]:
# index_col = 0 > specify the first column as indexes
data = pd.read_csv('students_data.csv', index_col = 0)

## Understanding the data

In [7]:
# preview of first 5 rows
data.head()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%


In [8]:
# Last 5 columns
data.tail()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
142,"TIMOTHY NDEDA, OBURA",13322634.0,Elgon,0.0,-78.0,40.0,99.0,70.0,49.0,99&,92&
143,"MUKUHA JERIEL, NGARA",1932845.0,Cherangani,321.0,94.0,780.0,420.0,71.0,88.0,56%,76%
144,"JOB, KAMAU",1430232.0,Nandi,43200.0,98.0,80.0,86.0,64.0,99.0,49%,69%
145,"CHEGE, KAMAU",159.0,Nandi,,508.0,409.0,77.0,58.0,56.0,88%,84%
146,"RAMADHAN, MUSA",87.0,Cherangani,,81.0,70.0,64.0,680.0,88.0,76%,72%


In [12]:
# column names
data.columns

Index(['names', 'admission number', 'house', 'balance', 'english', 'kiswahili',
       'mathematics', 'science', 'sst/cre', 'Creative Arts', 'music'],
      dtype='object')

In [13]:
# shape
data.shape

(147, 11)

In [14]:
# overview of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 146
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   names             147 non-null    object 
 1   admission number  124 non-null    float64
 2   house             26 non-null     object 
 3   balance           58 non-null     object 
 4   english           121 non-null    float64
 5   kiswahili         119 non-null    float64
 6   mathematics       130 non-null    float64
 7   science           117 non-null    float64
 8   sst/cre           132 non-null    float64
 9   Creative Arts     143 non-null    object 
 10  music             147 non-null    object 
dtypes: float64(6), object(5)
memory usage: 13.8+ KB


## Duplicates and unwanted observations

In [16]:
# checking for duplicates
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
142    False
143    False
144    False
145    False
146    False
Length: 147, dtype: bool

In [18]:
# return true if the data has duplicates
data.duplicated().any()

True

In [25]:
# how many trues do you have? ~variable to store number of duplicates
no_true = 0

#loop thro' a bool series, where True is duplicated and False is not duplicated
for val in data.duplicated():
    if (val == True):
        # increment the no. of True values by 1 upon finding a duplicate
        no_true += 1
print(no_true)
print(f"{np.round((no_true/len(data)), 5) *100} %")

8
5.442 %


In [20]:
# convert the number into a percentage
percentage = no_true/len(data)
print(percentage)

0.05442176870748299


In [28]:
# converting into per
conv_per = percentage*100
print(conv_per)

5.442176870748299


In [30]:
# rounding off 
round_off = np.round(conv_per, 4)
print(round_off)

5.4422


In [31]:
# display the percentage
print(f"{round_off}% of the data is duplicated")

5.4422% of the data is duplicated


## Removing Duplicates

In [32]:
# subset ~particular portion - but we intend to remove for all
# keep "first" or "last"
data.drop_duplicates(subset = None, keep = "first", inplace = True)

In [34]:
data.duplicated().any()

False

## Duplicates in specific columns

* Unique identifiers should not be duplicated.

In [35]:
data.duplicated(subset = ['admission number']).any()

True

In [36]:
admission_col = data.duplicated(subset = ['admission number'])
type(admission_col)

pandas.core.series.Series

In [37]:
number_true = 0

for val in admission_col:
    if (val == True):
        # increment the no. of True values by 1 upon finding a duplicate
        number_true += 1
print(number_true)

47


In [43]:
# list the shared admissions
shared_admission_index = []

for index, val in enumerate(admission_col):
    if (val == True):
        shared_admission_index.append(index)

print(shared_admission_index)

[12, 16, 28, 37, 43, 44, 50, 52, 61, 63, 66, 70, 72, 74, 75, 78, 81, 82, 90, 91, 93, 94, 95, 100, 102, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133]


In [38]:
admission_col.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            137, 138, 139, 140, 141, 142, 143, 144, 145, 146],
           dtype='int64', length=139)

In [39]:
admission_col.values

array([False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False,  True,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False, False, False, False, False,  True, False,
        True, False, False,  True, False, False, False,  True, False,
        True, False,  True,  True, False, False,  True, False, False,
        True,  True, False, False, False, False, False, False, False,
        True,  True, False,  True,  True,  True, False, False, False,
       False,  True, False,  True, False, False, False, False, False,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,