<a href="https://colab.research.google.com/github/SARA3SAEED/DA-Mu/blob/main/s07b_data_cleaning_exercises_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Manipulation Exercises

- Data Cleaning & Preparation Exercises
    - Dealing with Missing & Duplicated Data
    - String Manipulation (Regular Expression)
    - Data Transformation

##### Importing Libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

==========

## Data Cleaning

##### Importing Data

In [None]:
mpg = pd.read_csv('data/mpg-unclean.csv')

##### Inspecting the DataFrame and Identifing the Inconsistent Data

In [None]:
mpg.head()

In [None]:
mpg.info()

In [None]:
mpg.describe().round(2)

In [None]:
mpg.plot(subplots = True, figsize = (15,10))
plt.show()

##### Q. Identify one column label that should be changed and adjust/rename the column label!

In [None]:
mpg.rename(columns = {"model year": "model_year"}, inplace = True)

##### Q. Have a closer look to the origin column by analyzing the frequency/count of unique values! Can you find any inconsistency?

In [None]:
mpg.origin.value_counts()

##### Q. Replace the value "United States" in the origin column! Save the change!

In [None]:
mpg['origin'].replace("United States", "usa", inplace = True)

In [None]:
mpg['origin'].value_counts()

##### Q. Inspect and identify the problem in the column horsepower!

In [None]:
mpg.horsepower.head()

In [None]:
mpg['horsepower'] = mpg['horsepower'].str.replace(" hp", "")

In [None]:
mpg['horsepower'].value_counts().head()

In [None]:
# Create "real" missing values in the column horsepower!
mpg['horsepower'].replace("Not available", np.nan, inplace = True)

##### Q. Now you can convert the datatype in the column horsepower! Overwrite the column!

In [None]:
mpg['horsepower'] = mpg['horsepower'].astype("float")

In [None]:
mpg.info()

In [None]:
mpg.tail()

##### Q. What about the 'name' column?

In [None]:
mpg['name'].head()

In [None]:
mpg.name = mpg.name.str.lower().str.strip()

In [None]:
# Convert all names to lowercase and remove all whitespaces on the left ends and right ends!
mpg.head()

In [None]:
mpg.describe().round(2)

##### Q. Inspect the column __model_year__ in more detail by analyzing the __frequency/counts__ of unique values! Anything __strange__?

In [None]:
mpg['model_year'].value_counts()

In [None]:
mpg['model_year'].replace(1973, 73, inplace = True)

##### Q. Inspect the column weight by sorting the values from high to low. Can you see the extreme value?

In [None]:
mpg['weight'].sort_values(ascending = False)

In [None]:
# Select the complete row of the outlier with the method idxmax()!
mpg.loc[mpg.weight.idxmax()]

In [None]:
# Overwrite the erroneous outlier! Fill in the gaps!
mpg.loc[mpg.weight.idxmax(), "weight"] = 2300

##### Q. Let's check out the column mpg too

In [None]:
mpg['mpg'].sort_values()

In [None]:
# Select the complete row of the outlier with the method idxmin
mpg.loc[mpg['mpg'].idxmin()]

In [None]:
# After some research we have found out that this extreme value is in "gallons per mile" units instead of "miles per gallon". Convert to "miles per gallon" units!
mpg.loc[mpg.mpg.idxmin(), "mpg"] = 1/mpg.loc[mpg.mpg.idxmin(), "mpg"]

##### Q. Select all rows with at least one missing/na value!

In [None]:
mpg.loc[mpg.isna().any(axis = 1)]

In [None]:
mpg.dropna(inplace= True)

##### Q. Finding the duplicated records in care names

In [None]:
mpg.duplicated().sum()

In [None]:
mpg.duplicated(subset = ["name"]).sum()

In [None]:
mpg.loc[mpg.duplicated(subset = ["name"], keep = False)].sort_values("name")

In [None]:
mpg.loc[mpg.duplicated(keep = False)].sort_values("name")

In [None]:
mpg.drop_duplicates(inplace = True)

In [None]:
mpg.head()

In [None]:
mpg.info()

##### Q. It's a good practice to save the cleaned version of your dataset again

In [None]:
mpg.to_csv("data/mpg_clean.csv", index= False)

==========

## String Manipulation (Regular Expression)

##### Check if String Contain Only Defined Characters using Regex

In [None]:
import re
if re.search(r'^[1234]+$', '2134'):
    print(True)

##### Count Uppercase, Lowercase, and numeric values using Regex

In [None]:
s = 'My name is Mustafa Othman, I am 34 years old'

import re
upper = re.findall(r'[A-Z]', s)
lower = re.findall(r'[a-z]', s)
numeric = re.findall(r'[0-9]', s)

print('The no. of uppercase characters is:', len(upper))
print('The no. of lowercase characters is:', len(lower))
print('The no. of numerical characters is:', len(numeric))

##### Regex to extract maximum numeric value from a string

In [None]:
s = '100khj26io58sgtq1723mnb'
import re
numeric = re.findall(r'\d+', s)
max([int(i) for i in numeric])

##### Remove all characters except letters and numbers

In [None]:
s = "123abcjw:, .@! eiw"
re.sub('[\W_]+', '', s)

##### Regex to put spaces between words starting with capital letters

In [None]:
results = re.findall('[A-Z][a-z]*', 'MustafaOthmanMustafaEl-Nahas')
' '.join(results)

==========

# GOOD LUCK!