## What are the different types of data issues?

## Incomplete data

* This is data with missing fields.

## Duplicated entries

* Duplicate data is any entry that inadvertently shares data with another entry in a database. Carbon copy of a ror/column

## Invalid Data

* Wrong datatypes to represent the data e.g datetime

## Conflicting data

* using different cases to represent the same thing for instant.

## Import Python Libraries

In [3]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## Load the data

In [5]:
# index_col = 0 > specify the first column as indexes
data = pd.read_csv("../Data Analysis/Data/students_data.csv", index_col = 0)

## Understanding the data

In [6]:
# preview of first 5 rows
data.head()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%


In [7]:
# Last 5 columns
data.tail()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
142,"TIMOTHY NDEDA, OBURA",13322634.0,Elgon,0.0,-78.0,40.0,99.0,70.0,49.0,99&,92&
143,"MUKUHA JERIEL, NGARA",1932845.0,Cherangani,321.0,94.0,780.0,420.0,71.0,88.0,56%,76%
144,"JOB, KAMAU",1430232.0,Nandi,43200.0,98.0,80.0,86.0,64.0,99.0,49%,69%
145,"CHEGE, KAMAU",159.0,Nandi,,508.0,409.0,77.0,58.0,56.0,88%,84%
146,"RAMADHAN, MUSA",87.0,Cherangani,,81.0,70.0,64.0,680.0,88.0,76%,72%


In [8]:
# column names
data.columns

Index(['names', 'admission number', 'house', 'balance', 'english', 'kiswahili',
       'mathematics', 'science', 'sst/cre', 'Creative Arts', 'music'],
      dtype='object')

In [9]:
# shape
data.shape

(147, 11)

In [10]:
# overview of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 146
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   names             147 non-null    object 
 1   admission number  124 non-null    float64
 2   house             26 non-null     object 
 3   balance           58 non-null     object 
 4   english           121 non-null    float64
 5   kiswahili         119 non-null    float64
 6   mathematics       130 non-null    float64
 7   science           117 non-null    float64
 8   sst/cre           132 non-null    float64
 9   Creative Arts     143 non-null    object 
 10  music             147 non-null    object 
dtypes: float64(6), object(5)
memory usage: 13.8+ KB


## Duplicates and unwanted observations

In [11]:
# checking for duplicates
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
142    False
143    False
144    False
145    False
146    False
Length: 147, dtype: bool

In [12]:
# return true if the data has duplicates
data.duplicated().any()

True

In [13]:
# how many trues do you have? ~variable to store number of duplicates
no_true = 0

#loop thro' a bool series, where True is duplicated and False is not duplicated
for val in data.duplicated():
    if (val == True):
        # increment the no. of True values by 1 upon finding a duplicate
        no_true += 1
print(no_true)
print(f"{np.round((no_true/len(data)), 5) *100} %")

8
5.442 %


In [14]:
# convert the number into a percentage
percentage = no_true/len(data)
print(percentage)

0.05442176870748299


In [15]:
# converting into per
conv_per = percentage*100
print(conv_per)

5.442176870748299


In [16]:
# rounding off 
round_off = np.round(conv_per, 4)
print(round_off)

5.4422


In [17]:
# display the percentage
print(f"{round_off}% of the data is duplicated")

5.4422% of the data is duplicated


## Removing Duplicates

In [18]:
# subset ~particular portion - but we intend to remove for all
# keep "first" or "last"
data.drop_duplicates(subset = None, keep = "first", inplace = True)

In [35]:
# once you drop duplicates reset the index
data.reset_index(drop = True, inplace = True)

In [36]:
data.duplicated().any()

False

## Duplicates in specific columns

* Unique identifiers should not be duplicated.

In [37]:
data.duplicated(subset = ['admission number']).any()

True

In [38]:
admission_col = data.duplicated(subset = ['admission number'])
type(admission_col)

pandas.core.series.Series

In [39]:
number_true = 0

for val in admission_col:
    if (val == True):
        # increment the no. of True values by 1 upon finding a duplicate
        number_true += 1
print(number_true)

47


In [40]:
# list the shared admissions
shared_admission_index = []

for index, val in enumerate(admission_col):
    if (val == True):
        shared_admission_index.append(index)

print(shared_admission_index)

[12, 16, 28, 37, 43, 44, 50, 52, 61, 63, 66, 70, 72, 74, 75, 78, 81, 82, 90, 91, 93, 94, 95, 100, 102, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133]


In [41]:
admission_col.index

RangeIndex(start=0, stop=139, step=1)

In [42]:
admission_col.values

array([False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False,  True,  True,
       False, False, False, False, False,  True, False,  True, False,
       False, False, False, False, False, False, False,  True, False,
        True, False, False,  True, False, False, False,  True, False,
        True, False,  True,  True, False, False,  True, False, False,
        True,  True, False, False, False, False, False, False, False,
        True,  True, False,  True,  True,  True, False, False, False,
       False,  True, False,  True, False, False, False, False, False,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [43]:
# Accessing column admission number by name of the column
data['admission number']

0         13259.0
1         13243.0
2         13307.0
3         13258.0
4         13363.0
          ...    
134    13322634.0
135     1932845.0
136     1430232.0
137         159.0
138          87.0
Name: admission number, Length: 139, dtype: float64

In [45]:
# accessing a value of a series by its index
data['admission number'][137]

159.0

#### Review of loc and iloc

`df.loc[]` and `df.iloc[]`

* syntax
```
df.loc[row_label, column_lable]
df.iloc[row_index, column_index]
```
`loc` is used to select row
`iloc` integer based intexing

In [34]:
# accessing a single element using loc
value = data.loc[137, "admission number"]
print(value)

13243.0


In [46]:
# accessing a single element using iloc
value1 = data.iloc[137,1]
value1

159.0

* Syntax
```
df.loc[row_label, start_column_label:start_column_label]
```

Intend to access a single row

```
df.loc[row_label,:]
```
leaving the column label as blank so that the python uses the default values ie the first value from last.

In [47]:
# select a whole row
data.loc[137, :]

names               CHEGE, KAMAU
admission number             159
house                      Nandi
balance                      NaN
english                      508
kiswahili                    409
mathematics                   77
science                       58
sst/cre                       56
Creative Arts                88%
music                        84%
Name: 137, dtype: object

In [48]:
data.iloc[137,:]

names               CHEGE, KAMAU
admission number             159
house                      Nandi
balance                      NaN
english                      508
kiswahili                    409
mathematics                   77
science                       58
sst/cre                       56
Creative Arts                88%
music                        84%
Name: 137, dtype: object

Selecting multiple rows
```
df.loc[start_row_label:start_row_label, start_column_label:start_column_label]
```

In [52]:
# Selecting multiple rows
top = data.loc[0:7, :]
top

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%
5,"MUENDO CICILY, MUTHEU",13283.0,,,43.0,60.0,48.0,94.0,97.0,69%,45%
6,"IAN MWAI, TOYOTA",13233.0,,23204.0,82.0,68.0,91.0,69.0,81.0,44%,59%
7,"PITA SHEKINAH, WISE",13389.0,,,61.0,81.0,53.0,73.0,83.0,64%,64%


In [50]:
# Selecting multiple rows
data.iloc[0:7, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%
5,"MUENDO CICILY, MUTHEU",13283.0,,,43.0,60.0,48.0,94.0,97.0,69%,45%
6,"IAN MWAI, TOYOTA",13233.0,,23204.0,82.0,68.0,91.0,69.0,81.0,44%,59%


In [53]:
# select a column
data.loc[:, "house"]

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
134         Elgon
135    Cherangani
136         Nandi
137         Nandi
138    Cherangani
Name: house, Length: 139, dtype: object

In [56]:
data.iloc[:, 2]

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
134         Elgon
135    Cherangani
136         Nandi
137         Nandi
138    Cherangani
Name: house, Length: 139, dtype: object

In [57]:
# creating a slice
data.loc[2:7, "house":"science"]

Unnamed: 0,house,balance,english,kiswahili,mathematics,science
2,,,54.0,49.0,53.0,59.0
3,,,71.0,97.0,92.0,41.0
4,,,40.0,84.0,74.0,82.0
5,,,43.0,60.0,48.0,94.0
6,,23204.0,82.0,68.0,91.0,69.0
7,,,61.0,81.0,53.0,73.0


In [58]:
data.iloc[2:8, 2:8]

Unnamed: 0,house,balance,english,kiswahili,mathematics,science
2,,,54.0,49.0,53.0,59.0
3,,,71.0,97.0,92.0,41.0
4,,,40.0,84.0,74.0,82.0
5,,,43.0,60.0,48.0,94.0
6,,23204.0,82.0,68.0,91.0,69.0
7,,,61.0,81.0,53.0,73.0


#### Duplicates in specific columns Cont..

* Unique identifiers should not be duplicated.

In [60]:
print(shared_admission_index)

[12, 16, 28, 37, 43, 44, 50, 52, 61, 63, 66, 70, 72, 74, 75, 78, 81, 82, 90, 91, 93, 94, 95, 100, 102, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133]


In [59]:
# access a single row 
data.loc[12,:]

names               ADANGA, JAMES
admission number            13383
house                         NaN
balance                       NaN
english                        61
kiswahili                      59
mathematics                    95
science                        79
sst/cre                        78
Creative Arts                 80%
music                         82%
Name: 12, dtype: object

In [61]:
# compare
data[data["admission number"] == 13383.0]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
11,"PRETIE, ARIONA",13383.0,,8814.0,82.0,30.0,79.0,67.0,98.0,62%,79%
12,"ADANGA, JAMES",13383.0,,,61.0,59.0,95.0,79.0,78.0,80%,82%


In [68]:
data.loc[133,:]

names               William, Okomba
admission number                NaN
house                    Cherangani
balance                       4,141
english                         NaN
kiswahili                       NaN
mathematics                      60
science                         NaN
sst/cre                         NaN
Creative Arts                   70%
music                           70%
Name: 133, dtype: object

In [74]:
# changing Prettie amission number
# pretie is the first duplicate
# Adanga is the last duplicate
pretie_index = data[data["admission number"] == 13383.0].index[0]

In [75]:
# use loc to change
data.loc[pretie_index, "admission number"] = 15000

In [76]:
data.loc[11:12, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
11,"PRETIE, ARIONA",15000.0,,8814.0,82.0,30.0,79.0,67.0,98.0,62%,79%
12,"ADANGA, JAMES",15000.0,,,61.0,59.0,95.0,79.0,78.0,80%,82%


In [77]:
adanga_index = data[data["admission number"] == 15000].index[1]

In [78]:
data.loc[adanga_index, "admission number"] = 13383.0

In [79]:
data.loc[11:12, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
11,"PRETIE, ARIONA",15000.0,,8814.0,82.0,30.0,79.0,67.0,98.0,62%,79%
12,"ADANGA, JAMES",13383.0,,,61.0,59.0,95.0,79.0,78.0,80%,82%
