# **What do you mean by inconsistent Data Entry ?**



1.   Varying Date formats
2.   Inconsistent Units
3.   Spelling and Typographical Errors
4.   Inconsistent Categorization
5.   Incomplete or Missing Values








---
look at the following sample code snippet.

In [None]:
import pandas as pd

# Load the data into a Pandas DataFrame
df = pd.read_csv('data.csv')

# Handle inconsistent date formats
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

# Standardize units
df['weight'] = df['weight'].str.replace('lbs', '').astype(float)  # Remove 'lbs' and convert to float

# Standardize categorical values
df['category'] = df['category'].str.lower()  # Convert to lowercase

# Correct spelling errors
df['name'] = df['name'].str.replace('mispelled', 'misspelled')  # Replace misspelled value

# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)  # Fill missing values with median

# Drop duplicates
df.drop_duplicates(inplace=True)


# Save the cleaned data to a new file
df.to_csv('cleaned_data.csv', index=False)


The code demonstrates the following techniques for handling inconsistent data:

1. Date Format: We use the pd.to_datetime function to convert the 'date' column 
   to a standardized datetime format.

2. Unit Standardization: We remove the 'lbs' unit from the 'weight' column and 
   convert it to a float for consistent unit representation.

3. Categorical Standardization: We convert the 'category' column to lowercase 
   for consistent categorical values.

4. Spelling Correction: We replace a misspelled value in the 'name' column.

5. Missing Value Handling: We fill missing values in the 'age' column with the 
   median value.

6. Duplicate Removal: We drop duplicate rows from the DataFrame.

7. Saving Cleaned Data: We save the cleaned data to a new CSV file named 
   'cleaned_data.csv'.

Let us try each of the Techniques.
---
# I) Categorical Standardization


In [None]:
import pandas as pd

# Load the data into a DataFrame
data = pd.read_csv('data.csv')

# Define the standardized categories
standard_categories = {
    'category1': 'Category A',
    'category2': 'Category B',
    'category3': 'Category C',
    'category4': 'Category D'
}

# Standardize the 'category' column
data['category'] = data['category'].replace(standard_categories)

# Print the updated DataFrame
print(data)


# II) Spelling Correction

In [None]:
import pandas as pd
import textdistance

# Load the data into a DataFrame
data = pd.read_csv('data.csv')

# Define a function for spelling correction
def correct_spelling(word, dictionary):
    corrected_word = min(dictionary, key=lambda x: textdistance.levenshtein.normalized_distance(word, x))
    return corrected_word

# Load the spelling dictionary
spelling_dictionary = ['apple', 'banana', 'orange', 'grape']

# Apply spelling correction to the 'fruit' column
data['fruit'] = data['fruit'].apply(lambda x: correct_spelling(x, spelling_dictionary))

# Print the updated DataFrame
print(data)


# III) Duplicate Removal

In [None]:
import pandas as pd

# Remove duplicates based on all columns
df_no_duplicates = df.drop_duplicates()

# Remove duplicates based on specific columns
df_no_duplicates = df.drop_duplicates(subset=['column1', 'column2'])

# Modify the original DataFrame to remove duplicates
df.drop_duplicates(inplace=True)


# Challenge Problem : Find any data inconsistencies in the House price dataset and correct them by applying the techniques mentioned above .