In data science or Machine learning, one of the most important tasks performed when working with large amounts of data is data cleaning. In this assignment, you'll see how to clean a dataset using pandas step by step. This dataset is a simulation of a customer list.

**Q1	What is machine learning? Where and why you will use machine learning?**

Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed. In other words, we can say that ML is a field of AI consisting of learning algorithms that improves their performance (P) at executing some task (T) over time with experience (E). Experience*Task=Performance. Machine learning is a subset of artificial intelligence focused on building systems that can learn from historical data, identify patterns, and make logical decisions with little to no human intervention.


Machine learning is used on OTT platforms like netflix for recommendations systems. Netflix considers users viewing habits and hobbies to provide Netflix recommendations. Users can take charge of their multimedia streaming and customize their interactions owing to the system's ability to compile and recommend content based on their preferences It can adapt and learn from each user's interactions, leading to a more personalized and engaging experience. Machine learning helps users discover content that aligns with their specific interests, making it easier to navigate this extensive catalog as Netflix has a vast and constantly growing library of movies and TV shows. Netflix wants users to discover new and diverse content. Machine learning algorithms can recommend not only popular content but also hidden gems that align with a user's unique preferences.

**Q2 What is normalization/Scaling in Machine Learning and why do you perform? Explain with examples**

Normalization in machine learning is the process of translating data into the range [0, 1] (or any other range) or simply transforming data onto the unit sphere.The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values
Normalization dramatically improves model accuracy. Normalization gives equal weights/importance to each variable so that no single variable steers model performance in one direction just because they are bigger numbers.

Examples of Normalization:

We can consider age feature which can change between 0 to 90 and salary which can be changed from 0 to 500,000,000 $.The optimization algorithm used in our ML model will take too much time, if possible, to find appropriate weights for each feature. Moreover, if you don't scale your data, your ML algorithm may take too much care to features with large scales.

Currency Conversion: When dealing with international finance, you might normalize different currencies into a single currency, such as the US dollar, for easier comparison and analysis.


**Q3 What is supervised and unsupervised learning? Give some examples** 

Supervised Learning:

Supervised learning algorithms or methods are the most commonly used ML algorithms.This method or learning algorithm take the data sample i.e. the training data and its associated output i.e. labels or responses with each data samples during the training process.
The main objective of supervised learning algorithms is to learn an association between input data samples and corresponding outputs after performing multiple training data instances.

Supervised learning basically tries to model the relationship between the inputs and their corresponding outputs from the training data so that we would be able to predict output responses for new data inputs based on the
knowledge it gained earlier with regard to relationships and mappings between the inputs and their target outputs. 
For example, we have
x: Input variables and
Y: Output variable
Now, apply an algorithm to learn the mapping function from the input to output as follows:
Y=f(x)
Now, the main objective would be to approximate the mapping function so well that even
when we have new input data (x), we can easily predict the output variable (Y) for that new
input data.

Examples of Supervised Learning:

Email Filtering
Supervised learning can be used to build models that classify emails as either "spam" or "not spam" based on the content and features of the emails.

Voice Recognition
Supervised learning is utilized in voice recognition to help virtual assistants and other applications recognize and understand spoken commands. A labeled dataset of spoken words and phrases with corresponding text transcripts is used to train a machine learning algorithm in such scenarios.

Unsupervised Learning:

As the name suggests, it is opposite to supervised ML methods or algorithms which means in unsupervised machine learning algorithms we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data.

For example, it can be understood as follows: 
Suppose we have: x: Input variables, then there would be no corresponding output variable and the algorithms need to discover the interesting pattern in data for learning

Examples of Unsupervised Learning:

Medical imaging – for distinguishing between different kinds of tissues;

Market research – for differentiating groups of customers based on some attributes




**Q4 What is Data Cleaning and why do we need it?**  

Data cleaning refers to identifying and correcting errors in the dataset that may negatively impact a predictive model. Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.
There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on. 

Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits include:

Removal of errors when multiple sources of data are at play.
Fewer errors make for happier clients and less-frustrated employees.

Duplicates usually show up when data is blended from different sources (e.g., spreadsheets, websites, and databases) or when a customer has multiple points of contact with a company or has submitted redundant forms. This repeated data uses up server space and processing resources, creating larger files and less efficient analysis.
in this case data cleaning can help with removing the duplicates and only keeping the unique values.

Manage Incomplete Data: Data might be missing values for a few reasons (e.g., customers not providing certain information), and addressing it is vital to analysis as it prevents bias and miscalculations. After isolating and examining the incomplete values, which can show up as “0,” “NA,” “none,” “null,” or “not applicable,” determine if those are plausible values or due to missing information. While the easiest solution might be to drop the incomplete data, be aware of any bias that might result in that action. Alternatives include replacing null values with substitutes based on statistical or conditional modeling or flagging and commenting on the missing data.

Remove Irrelevant Observations: Data that’s not relevant to the problem being solved can slow down processing time. Removing these irrelevant observations doesn’t delete them from the source but excludes them from the current analysis. 

Fix Structural Errors:It’s important to correct errors and inconsistencies including typography, capitalization, abbreviation, and formatting


In [1]:
import pandas as pd

In [2]:
df_users = pd.DataFrame({
    "user_id": [234, 235, 236, 237, 237, 238, 239, 240, 241, 242, 242],
    "Name": ["Tom", "Alex--", "..Thomas", "John", "John", "Paul/", "Emma9", "Joy", "Samantha_", "Emily", "Emily"],
    "Last_name": ["Smith", "johnson", "brown", "Davis", "Davis", "None", "wilson", "Thompson", "Lee", "Johnson", "Johnson"],
    "age": [23, 32, 45, 22, 22, 50, 34, 47, 28, 19, 19],
    "Phone": ["555/123/4567", "333-234-5678", "444_456_7890", "111-222-3333", "111-222-3333", None, "333/987/4567", "222/345_987", "(777) 987-6543", "777-888-9999", "777-888-9999"],
    "Email": ["smith@email.com", "johnson@hotmail.com", "brown@email.com", "davis@mail.com", "davis@mail.com", "John@gmail.com", "wilson@mail.com", "thompson@email.com", "lee@email.com", "emily@hotmail.com", "emily@hotmail.com"],
    "Not_Useful_column": [None, None, None, None, None, None, None, None, None, None, None]
})

print(df_users)

    user_id       Name Last_name  age           Phone                Email  \
0       234        Tom     Smith   23    555/123/4567      smith@email.com   
1       235     Alex--   johnson   32    333-234-5678  johnson@hotmail.com   
2       236   ..Thomas     brown   45    444_456_7890      brown@email.com   
3       237       John     Davis   22    111-222-3333       davis@mail.com   
4       237       John     Davis   22    111-222-3333       davis@mail.com   
5       238      Paul/      None   50            None       John@gmail.com   
6       239      Emma9    wilson   34    333/987/4567      wilson@mail.com   
7       240        Joy  Thompson   47     222/345_987   thompson@email.com   
8       241  Samantha_       Lee   28  (777) 987-6543        lee@email.com   
9       242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   
10      242      Emily   Johnson   19    777-888-9999    emily@hotmail.com   

   Not_Useful_column  
0               None  
1               N

Here we use the pandas `DataFrame()` function to create a mock dataset, this dataset contains 7 columns and 11 rows, the columns are, a `user_id` which is the user's unique id, a `Name` column, a `Last_name` column, the user's `age`, the user's `Phone` number, the user's `Email`, and finally a non-useful column called `Not_Useful_column` which we will use as an example of how to delete an unnecessary column from a dataset.

As you can see in the example dataset, the data has some inconsistencies in the columns, a few unnecessary symbols in the `Name` column, some values in the `Last_name` column are not capitalized, and each of the values in the `Phone` column have different syntax which makes it difficult to work with them.


Your final output should look like 

In [11]:
df_users

Unnamed: 0,user_id,Name,Last_name,age,Phone,Email
0,234,Tom,Smith,23,555-123-4567,smith@email.com
1,235,Alex,Johnson,32,333-234-5678,johnson@hotmail.com
2,236,Thomas,Brown,45,444-456-7890,brown@email.com
3,237,John,Davis,22,111-222-3333,davis@mail.com
4,239,Emma,Wilson,34,333-987-4567,wilson@mail.com
5,241,Samantha,Lee,28,777-987-6543,lee@email.com
6,242,Emily,Johnson,19,777-888-9999,emily@hotmail.com


In [4]:
#drop columns with all null values
df_users.dropna(axis=1,how='all',inplace=True)

In [5]:
#drop row with some null values
df_users.dropna(axis=0,how='any',inplace=True)

In [6]:
#drop all the duplicates and retain unique values

df_users.drop_duplicates(inplace=True)

In [7]:
#Get the names with only alphabets ignoring all the special characters
df_users['Name']=df_users['Name'].str.replace('[^a-zA-Z]','').str.strip()

  df_users['Name']=df_users['Name'].str.replace('[^a-zA-Z]','').str.strip()


In [8]:
#capitalise the first letter of last name
df_users['Last_name']=df_users['Last_name'].str.capitalize()

In [9]:
#correct the phone format 
#first get the phone numbers with only digits
df_users['Phone'] = df_users['Phone'].str.replace(r'[^\d]', '',regex=True)
#get the phone numbers in 3-3-4 format
df_users['Phone'] = df_users['Phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)
#then check which values gives the correct format set it to true rest to false
phone_format_corrected = df_users['Phone'].str.contains(r'\d{3}-\d{3}-\d{4}',na=False)
#get the phone numbers with True values only
df_users=df_users[phone_format_corrected]

In [10]:
#reset the index to get in correct order
df_users.reset_index(drop=True, inplace=True)
#print the user table
print(df_users)

   user_id      Name Last_name  age         Phone                Email
0      234       Tom     Smith   23  555-123-4567      smith@email.com
1      235      Alex   Johnson   32  333-234-5678  johnson@hotmail.com
2      236    Thomas     Brown   45  444-456-7890      brown@email.com
3      237      John     Davis   22  111-222-3333       davis@mail.com
4      239      Emma    Wilson   34  333-987-4567      wilson@mail.com
5      241  Samantha       Lee   28  777-987-6543        lee@email.com
6      242     Emily   Johnson   19  777-888-9999    emily@hotmail.com
