<a href="https://colab.research.google.com/github/PhaniChandraSekhar/AIML_Practice/blob/main/Data_PreProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

In [2]:
# Creating a sample dataset with missing values and categorical features
data = {'age': [25, 30, np.nan, 40, 45],
        'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
        'income': [50000, 60000, 70000, np.nan, 90000]}

df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,age,gender,city,income
0,25.0,Male,New York,50000.0
1,30.0,Female,London,60000.0
2,,Male,Paris,70000.0
3,40.0,Female,Tokyo,
4,45.0,Male,Sydney,90000.0


In [4]:
print(df.isnull().sum())

age       1
gender    0
city      0
income    1
dtype: int64


In [5]:
df.describe()

Unnamed: 0,age,income
count,4.0,4.0
mean,35.0,67500.0
std,9.128709,17078.251277
min,25.0,50000.0
25%,28.75,57500.0
50%,35.0,65000.0
75%,41.25,75000.0
max,45.0,90000.0


In [6]:
# Exercise 1: Handling missing values
# 1. Identify missing values in the dataset.
print("Missing values in the dataset:\n", df.isnull().sum())


Missing values in the dataset:
 age       1
gender    0
city      0
income    1
dtype: int64


In [7]:
df['age']

Unnamed: 0,age
0,25.0
1,30.0
2,
3,40.0
4,45.0


In [8]:
# 2. Replace missing values in 'age' column with the mean age.
df['age'].fillna(df['age'].mean(), inplace=True)

In [9]:
df.head()

Unnamed: 0,age,gender,city,income
0,25.0,Male,New York,50000.0
1,30.0,Female,London,60000.0
2,35.0,Male,Paris,70000.0
3,40.0,Female,Tokyo,
4,45.0,Male,Sydney,90000.0


In [10]:
# 3. Replace missing values in 'income' column with the median income.
df['income'].fillna(df['income'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['income'].fillna(df['income'].median(), inplace=True)


In [11]:
# prompt: df['income'].fillna(df['income'].median(), inplace=True) - in this line median is used, when to use median and when to use mean for fillna()

import pandas as pd
import numpy as np

# ... (your existing code)

# Explanation of when to use mean vs. median for filling NaN values:

# Mean:
# - Use the mean when the data is normally distributed or approximately symmetric.
# - Sensitive to outliers. Extreme values can significantly skew the mean, leading to
#   imputed values that might not represent the typical value.

# Median:
# - Use the median when the data is skewed, contains outliers, or has a non-normal distribution.
# - Robust to outliers because it represents the middle value in the sorted dataset.
# - Less affected by extreme values, providing a more representative imputation for skewed data.


# Example illustrating the difference:

# Create a DataFrame with an outlier
data_with_outlier = {'income': [50000, 60000, 70000, 80000, 1000000]} # outlier: 1000000
df_outlier = pd.DataFrame(data_with_outlier)

# Calculate mean and median
mean_income = df_outlier['income'].mean()
median_income = df_outlier['income'].median()

print(f"Mean income with outlier: {mean_income}")
print(f"Median income with outlier: {median_income}")

# Impute using the mean and median for the example dataframe with outlier
df_outlier['income_mean'] = df_outlier['income'].fillna(df_outlier['income'].mean())
df_outlier['income_median'] = df_outlier['income'].fillna(df_outlier['income'].median())

df_outlier
# Observation: The mean is heavily influenced by the outlier (1,000,000). The median remains
# relatively stable.  In this case, the median would be a more appropriate choice for imputation
# as it provides a value closer to the central tendency of the data.


# In your original code:
# df['income'].fillna(df['income'].median(), inplace=True)

# It's likely that the income data might be skewed (some people earn significantly more
# than others), and using the median is a more robust approach in such cases.  If there is an
# overwhelming number of outliers in the income, the mean will be inflated. It may be best to visualize the data
# with a box plot to determine the best imputation strategy.


Mean income with outlier: 252000.0
Median income with outlier: 70000.0


Unnamed: 0,income,income_mean,income_median
0,50000,50000,50000
1,60000,60000,60000
2,70000,70000,70000
3,80000,80000,80000
4,1000000,1000000,1000000


In [None]:
# prompt: when to use mean and when to use median in my data pre-processing step

# When to use mean and median for imputation:

# Mean:
# - Use when the data is normally distributed or approximately normally distributed.
# - Sensitive to outliers, so if your data has extreme values, using the mean might skew the results.
# - Suitable for numerical features.

# Median:
# - Use when the data is skewed or has outliers.
# - Less sensitive to outliers compared to the mean, providing a more robust central tendency.
# - Suitable for numerical features.

# Example in the provided code:
# - The mean was used for the 'age' column. If 'age' is normally distributed or has few outliers, then the mean imputation is suitable. Otherwise median imputation is preferred.
# - The median was used for the 'income' column. This is a good choice if 'income' is skewed due to potential outliers (high earners).  The median would be a better representation of a typical income value.

In [12]:
df.head()

Unnamed: 0,age,gender,city,income
0,25.0,Male,New York,50000.0
1,30.0,Female,London,60000.0
2,35.0,Male,Paris,70000.0
3,40.0,Female,Tokyo,65000.0
4,45.0,Male,Sydney,90000.0


In [15]:
# prompt: explain to me step by step why i need to write these lines - from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()

# You would use LabelEncoder from sklearn.preprocessing if you wanted to convert
# categorical features (like 'city' in your example) into numerical labels.

# Here's a breakdown of why you might need it and how it works:

# 1. Many machine learning algorithms work best with numerical data:
#    Many algorithms cannot directly handle categorical data like strings
#    (e.g., 'New York', 'London').

# 2. LabelEncoder assigns a unique integer to each unique category:
#    It essentially maps each distinct value in a categorical column to a number.
#    For example, it might map 'New York' to 0, 'London' to 1, 'Paris' to 2, etc.

# Example using your data:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])

# Explanation of the code:
# - from sklearn.preprocessing import LabelEncoder: Imports the LabelEncoder class.
# - le = LabelEncoder(): Creates an instance of the LabelEncoder class.
# - df['city_encoded'] = le.fit_transform(df['city']):
#   - le.fit_transform() fits the encoder to your data and then transforms
#     the data (the 'city' column).
#   - The result is a new column, 'city_encoded', containing the numerical labels.

# Note: LabelEncoder is suitable when there's an inherent ordinal relationship
# between the categories (e.g., low, medium, high). If there's no such
# relationship, one-hot encoding might be more appropriate (as you've already
# done with 'gender').


In [16]:
df['city_encoded']

Unnamed: 0,city_encoded
0,1
1,0
2,2
3,4
4,3


In [18]:
# 2. Convert 'city' column to numerical using label encoding.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender_encoded'] = le.fit_transform(df['gender'])

In [19]:
df

Unnamed: 0,age,gender,city,income,city_encoded,gender_encoded
0,25.0,Male,New York,50000.0,1,1
1,30.0,Female,London,60000.0,0,0
2,35.0,Male,Paris,70000.0,2,1
3,40.0,Female,Tokyo,65000.0,4,0
4,45.0,Male,Sydney,90000.0,3,1


In [20]:
df[['age', 'income']]

Unnamed: 0,age,income
0,25.0,50000.0
1,30.0,60000.0
2,35.0,70000.0
3,40.0,65000.0
4,45.0,90000.0


In [21]:
# Exercise 3: Feature scaling
# 1. Apply min-max scaling to 'age' and 'income' columns.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

In [22]:
df

Unnamed: 0,age,gender,city,income,city_encoded,gender_encoded
0,0.0,Male,New York,0.0,1,1
1,0.25,Female,London,0.25,0,0
2,0.5,Male,Paris,0.5,2,1
3,0.75,Female,Tokyo,0.375,4,0
4,1.0,Male,Sydney,1.0,3,1


In [23]:
# Exercise 4: Data splitting
# 1. Split the dataset into training and testing sets (80% training, 20% testing).
from sklearn.model_selection import train_test_split
X = df.drop('income', axis=1)  # Features
y = df['income']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [24]:
print(X)

    age  gender      city  city_encoded  gender_encoded
0  0.00    Male  New York             1               1
1  0.25  Female    London             0               0
2  0.50    Male     Paris             2               1
3  0.75  Female     Tokyo             4               0
4  1.00    Male    Sydney             3               1


In [25]:
print(y)

0    0.000
1    0.250
2    0.500
3    0.375
4    1.000
Name: income, dtype: float64


In [26]:
print(y_train)

2    0.500
0    0.000
3    0.375
Name: income, dtype: float64


In [27]:
print(X_train)

    age  gender      city  city_encoded  gender_encoded
2  0.50    Male     Paris             2               1
0  0.00    Male  New York             1               1
3  0.75  Female     Tokyo             4               0


In [28]:
print(X_test)

    age  gender    city  city_encoded  gender_encoded
1  0.25  Female  London             0               0
4  1.00    Male  Sydney             3               1


In [29]:
print("\nPreprocessed Data:\n", df)


Preprocessed Data:
     age  gender      city  income  city_encoded  gender_encoded
0  0.00    Male  New York   0.000             1               1
1  0.25  Female    London   0.250             0               0
2  0.50    Male     Paris   0.500             2               1
3  0.75  Female     Tokyo   0.375             4               0
4  1.00    Male    Sydney   1.000             3               1


In [None]:
# prompt: By replacing the 1st digit of the 2-digit number *3, it turns out that six of the nine possible values: 13, 23, 43, 53, 73, and 83, are all prime.
# By replacing the 3rd and 4th digits of 56**3 with the same digit, this 5-digit number is the first example having seven primes among the ten generated numbers, yielding the family: 56003, 56113, 56333, 56443, 56663, 56773, and 56993. Consequently 56003, being the first member of this family, is the smallest prime with this property.
# Find the smallest prime which, by replacing part of the number (not necessarily adjacent digits) with the same digit, is part of an eight prime value family.

def is_prime(n):
    """Checks if a number is prime."""
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_prime_family(prime_candidate):
    """Finds prime families by replacing digits."""
    s_candidate = str(prime_candidate)
    n = len(s_candidate)
    families = []

    # Iterate through all possible combinations of indices to replace
    for i in range(1, 2**n): # We need to replace at least one digit
        indices_to_replace = []
        for j in range(n):
            if (i >> j) & 1:
                indices_to_replace.append(j)

        # Only consider combinations where the replaced digits are the same
        first_replaced_digit = None
        is_valid_combination = True
        if indices_to_replace:
            first_replaced_digit = s_candidate[indices_to_replace[0]]
            for idx in indices_to_replace:
                if s_candidate[idx] != first_replaced_digit:
                    is_valid_combination = False
                    break
        else:
            is_valid_combination = False

        if is_valid_combination:
            family = []
            for digit in range(10):
                new_number_str_list = list(s_candidate)
                for idx in indices_to_replace:
                    new_number_str_list[idx] = str(digit)
                new_number_str = "".join(new_number_str_list)

                # Ensure the generated number does not start with 0 if it's a multi-digit number
                if len(new_number_str) > 1 and new_number_str.startswith('0'):
                    continue

                new_number = int(new_number_str)

                if is_prime(new_number):
                    family.append(new_number)

            if len(family) > 1: # Only consider families with more than one member
                families.append(sorted(family))
    return families

def solve():
    """Finds the smallest prime that is part of an eight prime value family."""
    prime_limit = 1000000 # Start searching up to a reasonable limit

    # Generate primes up to the limit
    primes = [n for n in range(2, prime_limit) if is_prime(n)]

    for prime in primes:
        families = find_prime_family(prime)
        for family in families:
            if len(family) == 8:
                # The problem states the smallest prime with the property
                # is the first member of the family found.
                return family[0] # Since families are sorted, the first element is the smallest

    return None # Should not reach here if a solution exists within the limit

smallest_prime_family_of_eight = solve()
print(f"The smallest prime which is part of an eight prime value family is: {smallest_prime_family_of_eight}")

