# What are the different types of data issues?

* Various types of data issues may occur in an organization when collecting data, combining multiple datasets, receiving data from clients, customers or other departments and inputting data. Some example data issues include;


## Incomplete data

* This is data with missing fields and rows that occurs when no data value is stored for an attribute in an observation. Missing data are a common occurrence and can have a significant effect on the insights that can be drawn from the data.


## Duplicated entries

* Duplicate data is any entry that inadvertently shares data with another entry in a Database, ie a complete carbon copy. Duplicate entries in a dataset are also a common occurrence.


## Invalid Data

* Data attributes are not conforming with the logical dataset mapping. This includes wrong data types and wrong data formats which in turn interferes with the analysis process. Remember the computer doesn’t understand 95% as a numerical representation but instead as a string.


## Conflicting data

* Occurs when there are same records with different attributes ie there are deviations between data intended to capture the same real-world entity and can mislead any analysis done on it.


## Import Python Libraries

In [191]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## Load the data 

In [192]:
# index_col = 0 >> specify the first column as indexs
data = pd.read_csv("Data/students_data.csv", index_col = 0) 

## Understanding the Data

In [193]:
# preview the first five rows 
data.head()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%


data.tail()

In [194]:
# column names 
data.columns

Index(['names', 'admission number', 'house', 'balance', 'english', 'kiswahili',
       'mathematics', 'science', 'sst/cre', 'Creative Arts', 'music'],
      dtype='object')

In [195]:
# shape 
data.shape

(147, 11)

In [196]:
# overview of the data 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 146
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   names             147 non-null    object 
 1   admission number  124 non-null    float64
 2   house             26 non-null     object 
 3   balance           58 non-null     object 
 4   english           121 non-null    float64
 5   kiswahili         119 non-null    float64
 6   mathematics       130 non-null    float64
 7   science           117 non-null    float64
 8   sst/cre           132 non-null    float64
 9   Creative Arts     143 non-null    object 
 10  music             147 non-null    object 
dtypes: float64(6), object(5)
memory usage: 13.8+ KB


## Duplicates and Unwanted Observations

In [197]:
# duplicates 
data.duplicated().any()

True

In [198]:
# variable to store number of duplicates
no_true = 0


# loop through a bool series, where True is duplicated and False in not duplicated 
for val in data.duplicated():
    if (val == True):
        # increment the number of True values by one upon finding a duplicate
        no_true += 1

print(no_true)
# print(f"{ np.round(((no_true/len(data)) * 100), 4) } %")

8


In [199]:
# Convert the number into a percentage 
percentage = no_true / len(data)
print(percentage)

0.05442176870748299


In [200]:
# Converting to percentage
conv_per = percentage * 100 
print(conv_per)

5.442176870748299


In [201]:
# round off
round_off = np.round(conv_per, 4)
print(round_off)

5.4422


In [202]:
# display the percentage 
print(f"{round_off}% of the data is duplicated")

5.4422% of the data is duplicated


In [203]:
data.drop_duplicates(subset = None, keep = "first", inplace = True)

In [204]:
# once you drop duplicates reset the index
data.reset_index(drop = True, inplace = True)

In [205]:
data.duplicated().any()

False

#### `df.loc[]` and `df.iloc[]`

* syntax 
```
df.loc[row_label, column_label]
df.iloc[row_index, column_index]
```

* Syntax 
```
df.loc[row_label, start_column_label: end_column_label]
```

Intend to access a single row 

```
df.loc[row_label, : ] 
```

leaving the column label as blank so that python uses the default values ie the **first** column to the **last**.

In [206]:
# accessing column admission number by name of the column
data['admission number']

0         13259.0
1         13243.0
2         13307.0
3         13258.0
4         13363.0
          ...    
134    13322634.0
135     1932845.0
136     1430232.0
137         159.0
138          87.0
Name: admission number, Length: 139, dtype: float64

In [207]:
# accessing a value of a series by it index
data['admission number'][137]

159.0

In [208]:
# single element 
value = data.loc[137, "admission number"]
print(value)

159.0


In [209]:
value = data.iloc[137, 1]
value

159.0

In [210]:
# whole row 
data.loc[137, :]

names               CHEGE, KAMAU
admission number           159.0
house                      Nandi
balance                      NaN
english                    508.0
kiswahili                  409.0
mathematics                 77.0
science                     58.0
sst/cre                     56.0
Creative Arts                88%
music                        84%
Name: 137, dtype: object

In [211]:
data.iloc[137, :]

names               CHEGE, KAMAU
admission number           159.0
house                      Nandi
balance                      NaN
english                    508.0
kiswahili                  409.0
mathematics                 77.0
science                     58.0
sst/cre                     56.0
Creative Arts                88%
music                        84%
Name: 137, dtype: object

Selecting multiple rows 

```
df.loc[start_row_label:end_row_label, start_column_label:end_column_label]
```

In [212]:
# selecting multiple rows
data.loc[0:7, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%
5,"MUENDO CICILY, MUTHEU",13283.0,,,43.0,60.0,48.0,94.0,97.0,69%,45%
6,"IAN MWAI, TOYOTA",13233.0,,23204.0,82.0,68.0,91.0,69.0,81.0,44%,59%
7,"PITA SHEKINAH, WISE",13389.0,,,61.0,81.0,53.0,73.0,83.0,64%,64%


In [213]:
data.iloc[0:7, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%
5,"MUENDO CICILY, MUTHEU",13283.0,,,43.0,60.0,48.0,94.0,97.0,69%,45%
6,"IAN MWAI, TOYOTA",13233.0,,23204.0,82.0,68.0,91.0,69.0,81.0,44%,59%


In [214]:
# select a column

data.loc[:, "house"]

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
134         Elgon
135    Cherangani
136         Nandi
137         Nandi
138    Cherangani
Name: house, Length: 139, dtype: object

In [215]:
data.iloc[:, 2]

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
134         Elgon
135    Cherangani
136         Nandi
137         Nandi
138    Cherangani
Name: house, Length: 139, dtype: object

In [216]:
# crearing a slice 
data.loc[2:7, "house":"science"]

Unnamed: 0,house,balance,english,kiswahili,mathematics,science
2,,,54.0,49.0,53.0,59.0
3,,,71.0,97.0,92.0,41.0
4,,,40.0,84.0,74.0,82.0
5,,,43.0,60.0,48.0,94.0
6,,23204.0,82.0,68.0,91.0,69.0
7,,,61.0,81.0,53.0,73.0


In [217]:
data.iloc[2:8, 2:8] 

Unnamed: 0,house,balance,english,kiswahili,mathematics,science
2,,,54.0,49.0,53.0,59.0
3,,,71.0,97.0,92.0,41.0
4,,,40.0,84.0,74.0,82.0
5,,,43.0,60.0,48.0,94.0
6,,23204.0,82.0,68.0,91.0,69.0
7,,,61.0,81.0,53.0,73.0


### Duplicates in Specific Columns

* Unique identifiers should not be duplicated.

In [218]:
data.shape

(139, 11)

In [219]:
data.duplicated(subset = ['admission number']).any()

True

In [220]:
admission_col = data.duplicated(subset = ['admission number'])
type(admission_col)

pandas.core.series.Series

In [221]:
# set value of True as zero / variable to store number of True
# True indicate Duplicated Admission Number 
no_true = 0

# looping through the series
# Series contains Bool ie True and False 
# True means the value has been dupicated 
# False means the value is unique
for val in admission_col:
    if (val == True):
        no_true += 1
        
print(no_true)

47


In [222]:
shared_admission_index = []

for index, val in enumerate(admission_col):
    if (val == True):
        shared_admission_index.append(index)
        
print(shared_admission_index)

[12, 16, 28, 37, 43, 44, 50, 52, 61, 63, 66, 70, 72, 74, 75, 78, 81, 82, 90, 91, 93, 94, 95, 100, 102, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133]


In [223]:
# access a single row 
data.loc[12, :]

names               ADANGA, JAMES
admission number          13383.0
house                         NaN
balance                       NaN
english                      61.0
kiswahili                    59.0
mathematics                  95.0
science                      79.0
sst/cre                      78.0
Creative Arts                 80%
music                         82%
Name: 12, dtype: object

In [224]:
# compare 
data[data["admission number"] == 13383.0]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
11,"PRETIE, ARIONA",13383.0,,8814.0,82.0,30.0,79.0,67.0,98.0,62%,79%
12,"ADANGA, JAMES",13383.0,,,61.0,59.0,95.0,79.0,78.0,80%,82%


In [225]:
# changing Pretie admission Number 
# Pretie is the first duplicate 
# Adanga is the last duplicate 
# slicing/accessing the data based on the value of the column "admission number"
# using .index[] to index value, 
# rem >> we already know pretie and james adanga share admission value
adanga_index = data[data["admission number"] == 13383.0].index[1]

In [226]:
# use loc to change 
data.loc[adanga_index, "admission number"] = 15000

In [227]:
data.loc[11:12, :]

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
11,"PRETIE, ARIONA",13383.0,,8814.0,82.0,30.0,79.0,67.0,98.0,62%,79%
12,"ADANGA, JAMES",15000.0,,,61.0,59.0,95.0,79.0,78.0,80%,82%


### Removing unwanted observations

In [228]:
# removing columns
data.drop(columns=['house'], inplace=True)

In [229]:
# alternative using axis
#data.drop([date", "month, "year"], axis = 1, inplace=True)

In [230]:
data.head()

Unnamed: 0,names,admission number,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,40.0,84.0,74.0,82.0,89.0,64%,46%


In [231]:
# remove rows
data.drop(index=[129, 130, 131] , inplace=True)

In [232]:
data.tail(10)

Unnamed: 0,names,admission number,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
126,"Roselynn Kamau, Rose",,5907.0,,,89.0,,48.0,42%,93%
127,"Samuel Gitahi, Mwangi",,,,,87.0,,,96%,79%
128,"Samuel, Kadima",,,,,76.0,,77.0,61%,79%
132,"Walter, Wanami",,,,,68.0,,56.0,70%,96%
133,"William, Okomba",,4141.0,,,60.0,,,70%,70%
134,"TIMOTHY NDEDA, OBURA",13322634.0,0.0,-78.0,40.0,99.0,70.0,49.0,99&,92&
135,"MUKUHA JERIEL, NGARA",1932845.0,321.0,94.0,780.0,420.0,71.0,88.0,56%,76%
136,"JOB, KAMAU",1430232.0,43200.0,98.0,80.0,86.0,64.0,99.0,49%,69%
137,"CHEGE, KAMAU",159.0,,508.0,409.0,77.0,58.0,56.0,88%,84%
138,"RAMADHAN, MUSA",87.0,,81.0,70.0,64.0,680.0,88.0,76%,72%


In [233]:
data.drop(index = range(134, 138, 1), inplace=True)
data.tail()

Unnamed: 0,names,admission number,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
127,"Samuel Gitahi, Mwangi",,,,,87.0,,,96%,79%
128,"Samuel, Kadima",,,,,76.0,,77.0,61%,79%
132,"Walter, Wanami",,,,,68.0,,56.0,70%,96%
133,"William, Okomba",,4141.0,,,60.0,,,70%,70%
138,"RAMADHAN, MUSA",87.0,,81.0,70.0,64.0,680.0,88.0,76%,72%


In [236]:
data.reset_index(drop=True, inplace=True)