<a href="https://colab.research.google.com/github/Indra1206/AI-ASSIGNMENTS/blob/main/assignment_17_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task:1**

# Task
Generate a Python script to clean an employee dataset by handling missing values, converting 'joining_date' to datetime, standardizing department names, and encoding categorical variables ('department', 'job_role'). The output should be a cleaned Pandas DataFrame.

In [10]:
import pandas as pd
import numpy as np

# 1. Generate a simple synthetic dataset
data = {
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'salary': [50000, 60000, np.nan, 75000, 55000, 62000, np.nan, 80000, 58000, 70000],
    'department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance', 'Human Resources', 'sales', 'IT', 'hr', 'Marketing'],
    'job_role': ['Associate', 'Manager', 'Analyst', 'Manager', 'Associate', 'Analyst', 'Associate', 'Manager', 'Analyst', 'Manager'],
    'joining_date': ['2020-01-15', '2019-05-20', '2021-03-10', '2018-11-01', '2022-07-25', '2020-09-18', np.nan, '2019-12-01', '2021-06-10', 'invalid-date']
}

df = pd.DataFrame(data)

print("Synthetic dataset created:")
display(df)


# 2. Handle missing values
print("\nMissing values before handling:")
display(df[['salary', 'department', 'joining_date']].isnull().sum())

# Fill missing values in 'salary' with the median
median_salary = df['salary'].median()
df['salary'].fillna(median_salary, inplace=True)

# Fill missing values in 'department' with the mode
mode_department = df['department'].mode()[0]
df['department'].fillna(mode_department, inplace=True)

# Fill missing values in 'joining_date' with a placeholder or a sensible default
# For simplicity, filling with a placeholder 'NaT' which will be handled by to_datetime later
df['joining_date'].fillna('NaT', inplace=True)


print("\nMissing values after handling:")
display(df[['salary', 'department', 'joining_date']].isnull().sum())


# 3. Convert to datetime
# Use errors='coerce' to turn unparseable dates into NaT
df['joining_date'] = pd.to_datetime(df['joining_date'], errors='coerce')
print("\n'joining_date' column converted to datetime.")
display(df.info())


# 4. Standardize department names
# Convert to lowercase and then map common variations to a standard name
df['department'] = df['department'].str.lower()
department_mapping = {
    'hr': 'Human Resources',
    'human resources': 'Human Resources',
    'it': 'Information Technology',
    'information technology': 'Information Technology',
    'sales': 'Sales',
    'marketing': 'Marketing',
    'finance': 'Finance',
    # Add other mappings as needed
}
df['department'] = df['department'].map(department_mapping).fillna(df['department']) # Keep original if no mapping


print("\nDepartment names standardized.")
display(df['department'].value_counts())

# 5. Encode categorical variables ('department', 'job_role')
# Using one-hot encoding for simplicity
df = pd.get_dummies(df, columns=['department', 'job_role'], dummy_na=False)

print("\nCategorical variables encoded.")
display(df.head())


# 6. Display the cleaned data
print("\nCleaned DataFrame:")
display(df.head())


# 7. Finish task
print("\nData cleaning steps completed. The DataFrame 'df' contains the cleaned data.")

Synthetic dataset created:


Unnamed: 0,employee_id,salary,department,job_role,joining_date
0,1,50000.0,HR,Associate,2020-01-15
1,2,60000.0,Sales,Manager,2019-05-20
2,3,,IT,Analyst,2021-03-10
3,4,75000.0,Marketing,Manager,2018-11-01
4,5,55000.0,Finance,Associate,2022-07-25
5,6,62000.0,Human Resources,Analyst,2020-09-18
6,7,,sales,Associate,
7,8,80000.0,IT,Manager,2019-12-01
8,9,58000.0,hr,Analyst,2021-06-10
9,10,70000.0,Marketing,Manager,invalid-date



Missing values before handling:


Unnamed: 0,0
salary,2
department,0
joining_date,1



Missing values after handling:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['salary'].fillna(median_salary, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['department'].fillna(mode_department, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

Unnamed: 0,0
salary,0
department,0
joining_date,0



'joining_date' column converted to datetime.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   employee_id   10 non-null     int64         
 1   salary        10 non-null     float64       
 2   department    10 non-null     object        
 3   job_role      10 non-null     object        
 4   joining_date  8 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 532.0+ bytes


None


Department names standardized.


Unnamed: 0_level_0,count
department,Unnamed: 1_level_1
Human Resources,3
Sales,2
Information Technology,2
Marketing,2
Finance,1



Categorical variables encoded.


Unnamed: 0,employee_id,salary,joining_date,department_Finance,department_Human Resources,department_Information Technology,department_Marketing,department_Sales,job_role_Analyst,job_role_Associate,job_role_Manager
0,1,50000.0,2020-01-15,False,True,False,False,False,False,True,False
1,2,60000.0,2019-05-20,False,False,False,False,True,False,False,True
2,3,61000.0,2021-03-10,False,False,True,False,False,True,False,False
3,4,75000.0,2018-11-01,False,False,False,True,False,False,False,True
4,5,55000.0,2022-07-25,True,False,False,False,False,False,True,False



Cleaned DataFrame:


Unnamed: 0,employee_id,salary,joining_date,department_Finance,department_Human Resources,department_Information Technology,department_Marketing,department_Sales,job_role_Analyst,job_role_Associate,job_role_Manager
0,1,50000.0,2020-01-15,False,True,False,False,False,False,True,False
1,2,60000.0,2019-05-20,False,False,False,False,True,False,False,True
2,3,61000.0,2021-03-10,False,False,True,False,False,True,False,False
3,4,75000.0,2018-11-01,False,False,False,True,False,False,False,True
4,5,55000.0,2022-07-25,True,False,False,False,False,False,True,False



Data cleaning steps completed. The DataFrame 'df' contains the cleaned data.


**Task:2**

# Task
Use AI to generate a script for preprocessing a sales transaction dataset.
Instructions:
• Convert transaction dates to proper datetime format.
• Create a new column for “Month-Year” from the transaction date.
• Remove rows with negative or zero transaction amounts.
• Normalize the "transaction_amount" column using Min-Max
scaling.
Expected Output:
• A preprocessed DataFrame with valid dates, normalized amounts,
and no invalid records

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# 1. Generate a simple synthetic sales transaction dataset
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'transaction_date': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-01', '2023-03-25', '2023-03-18', 'invalid-date', '2023-04-01', '2023-04-10', '2023-05-05'],
    'transaction_amount': [150.50, 200.00, -50.00, 120.75, 300.00, 180.00, 0, 250.00, 90.25, 500.00]
}

df_sales = pd.DataFrame(data)

print("Synthetic sales transaction dataset created:")
display(df_sales)

# 2. Convert transaction dates to proper datetime format.
# Use errors='coerce' to turn unparseable dates into NaT
df_sales['transaction_date'] = pd.to_datetime(df_sales['transaction_date'], errors='coerce')
print("\n'transaction_date' column converted to datetime.")
display(df_sales.info())

# 3. Create a new column for “Month-Year” from the transaction date.
# Drop rows where transaction_date is NaT before creating Month-Year
df_sales.dropna(subset=['transaction_date'], inplace=True)
df_sales['month_year'] = df_sales['transaction_date'].dt.to_period('M')
print("\n'month_year' column created.")
display(df_sales.head())


# 4. Remove rows with negative or zero transaction amounts.
initial_rows = len(df_sales)
df_sales = df_sales[df_sales['transaction_amount'] > 0]
print(f"\nRemoved {initial_rows - len(df_sales)} rows with non-positive transaction amounts.")
display(df_sales.head())

# 5. Normalize the "transaction_amount" column using Min-Max scaling.
scaler = MinMaxScaler()
df_sales['transaction_amount_normalized'] = scaler.fit_transform(df_sales[['transaction_amount']])
print("\n'transaction_amount' column normalized using Min-Max scaling.")
display(df_sales.head())

# 6. Display the preprocessed data
print("\nPreprocessed DataFrame:")
display(df_sales.head())

# 7. Finish task
print("\nSales transaction data preprocessing completed. The DataFrame 'df_sales' contains the preprocessed data.")

Synthetic sales transaction dataset created:


Unnamed: 0,transaction_id,transaction_date,transaction_amount
0,101,2023-01-15,150.5
1,102,2023-01-20,200.0
2,103,2023-02-10,-50.0
3,104,2023-02-01,120.75
4,105,2023-03-25,300.0
5,106,2023-03-18,180.0
6,107,invalid-date,0.0
7,108,2023-04-01,250.0
8,109,2023-04-10,90.25
9,110,2023-05-05,500.0



'transaction_date' column converted to datetime.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   transaction_id      10 non-null     int64         
 1   transaction_date    9 non-null      datetime64[ns]
 2   transaction_amount  10 non-null     float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 372.0 bytes


None


'month_year' column created.


Unnamed: 0,transaction_id,transaction_date,transaction_amount,month_year
0,101,2023-01-15,150.5,2023-01
1,102,2023-01-20,200.0,2023-01
2,103,2023-02-10,-50.0,2023-02
3,104,2023-02-01,120.75,2023-02
4,105,2023-03-25,300.0,2023-03



Removed 1 rows with non-positive transaction amounts.


Unnamed: 0,transaction_id,transaction_date,transaction_amount,month_year
0,101,2023-01-15,150.5,2023-01
1,102,2023-01-20,200.0,2023-01
3,104,2023-02-01,120.75,2023-02
4,105,2023-03-25,300.0,2023-03
5,106,2023-03-18,180.0,2023-03



'transaction_amount' column normalized using Min-Max scaling.


Unnamed: 0,transaction_id,transaction_date,transaction_amount,month_year,transaction_amount_normalized
0,101,2023-01-15,150.5,2023-01,0.147041
1,102,2023-01-20,200.0,2023-01,0.267846
3,104,2023-02-01,120.75,2023-02,0.074436
4,105,2023-03-25,300.0,2023-03,0.511897
5,106,2023-03-18,180.0,2023-03,0.219036



Preprocessed DataFrame:


Unnamed: 0,transaction_id,transaction_date,transaction_amount,month_year,transaction_amount_normalized
0,101,2023-01-15,150.5,2023-01,0.147041
1,102,2023-01-20,200.0,2023-01,0.267846
3,104,2023-02-01,120.75,2023-02,0.074436
4,105,2023-03-25,300.0,2023-03,0.511897
5,106,2023-03-18,180.0,2023-03,0.219036



Sales transaction data preprocessing completed. The DataFrame 'df_sales' contains the preprocessed data.


**TASK 3**

In [17]:
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

print("NLTK data downloaded.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


NLTK data downloaded.


In [13]:
# Install necessary libraries
%pip install nltk pandas beautifulsoup4



In [18]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

# The necessary NLTK data downloads are handled in a separate cell.
# Ensure the NLTK download cell (with punkt, stopwords, wordnet, punkt_tab)
# has been run successfully before running this cell.


# 1. Generate a simple synthetic social media text dataset
data = {
    'post_id': [1, 2, 3, 4, 5],
    'text': [
        "Wow, this is amazing! 😊 #happy #awesome",
        "Check out this link: https://example.com/great-article",
        "Feeling sad today... 😞",
        "This is the best product ever!!! 👍👍👍",
        "Having a great time with friends! 🎉🥳🥳"
    ]
}

df_social = pd.DataFrame(data)

print("Synthetic social media text dataset created:")
display(df_social)

# Define preprocessing functions
def remove_special_characters(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove emojis (basic regex, might not catch all)
    text = text.encode('ascii', 'ignore').decode('ascii')
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

def tokenize_text(text):
    # Ensure punkt tokenizer is available before tokenizing
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        print("Warning: NLTK 'punkt' tokenizer not found. Please run the NLTK download cell.")
        return [] # Return empty list if tokenizer is missing

    tokens = nltk.word_tokenize(text)
    return tokens

def remove_stopwords(tokens):
    # Ensure stopwords are available before removing
    try:
        stopwords_set = set(stopwords.words('english'))
    except LookupError:
        print("Warning: NLTK 'stopwords' not found. Please run the NLTK download cell.")
        return tokens # Return original tokens if stopwords are missing

    filtered_tokens = [word for word in tokens if word.lower() not in stopwords_set]
    return filtered_tokens

def lemmatize_tokens(tokens):
    # Ensure wordnet is available before lemmatizing
    try:
        lemmatizer = WordNetLemmatizer()
    except LookupError:
         print("Warning: NLTK 'wordnet' not found. Please run the NLTK download cell.")
         return tokens # Return original tokens if wordnet is missing

    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

# Apply preprocessing steps
df_social['cleaned_text'] = df_social['text'].apply(remove_special_characters)
df_social['lowercase_text'] = df_social['cleaned_text'].str.lower()
df_social['tokenized_text'] = df_social['lowercase_text'].apply(tokenize_text)
df_social['stopwords_removed'] = df_social['tokenized_text'].apply(remove_stopwords)
df_social['lemmatized_text'] = df_social['stopwords_removed'].apply(lemmatize_tokens)

# Join tokens back into a string for easier viewing (optional)
df_social['processed_text'] = df_social['lemmatized_text'].apply(lambda tokens: ' '.join(tokens))


# 6. Display the processed data
print("\nProcessed social media text DataFrame:")
display(df_social[['text', 'processed_text']].head())

# 7. Finish task
print("\nSocial media text preprocessing completed. The DataFrame 'df_social' contains the processed text, ready for NLP sentiment analysis.")

Synthetic social media text dataset created:


Unnamed: 0,post_id,text
0,1,"Wow, this is amazing! 😊 #happy #awesome"
1,2,Check out this link: https://example.com/great...
2,3,Feeling sad today... 😞
3,4,This is the best product ever!!! 👍👍👍
4,5,Having a great time with friends! 🎉🥳🥳



Processed social media text DataFrame:


Unnamed: 0,text,processed_text
0,"Wow, this is amazing! 😊 #happy #awesome",wow amazing happy awesome
1,Check out this link: https://example.com/great...,check link
2,Feeling sad today... 😞,feeling sad today
3,This is the best product ever!!! 👍👍👍,best product ever
4,Having a great time with friends! 🎉🥳🥳,great time friend



Social media text preprocessing completed. The DataFrame 'df_social' contains the processed text, ready for NLP sentiment analysis.


**TASK 4**

In [12]:
import pandas as pd
import numpy as np

# 1. Generate a simple synthetic healthcare patient records dataset
data = {
    'patient_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'blood_pressure': [120, 135, np.nan, 140, 118, 130, np.nan, 125, 138, 122],
    'heart_rate': [75, 82, 78, np.nan, 80, 79, 85, np.nan, 76, 81],
    'height_cm': [175, 162, 180, 155, 170, 168, 172, 160, 178, 165],
    'weight_kg': [70, 65, 80, 58, 72, 68, 75, 62, 78, 60],
    'gender': ['Male', 'Female', 'male', 'Female', 'M', 'Female', 'Male', 'Female', 'male', 'F'],
    'condition': ['Flu', 'Cold', 'Fever', 'Flu', 'Cold', 'Fever', 'Flu', 'Cold', 'Fever', 'Flu']
}

df_healthcare = pd.DataFrame(data)

print("Synthetic healthcare patient records dataset created:")
display(df_healthcare)

# 2. Fill missing values in numeric columns with column mean.
numeric_cols = ['blood_pressure', 'heart_rate']
for col in numeric_cols:
    if col in df_healthcare.columns:
        mean_val = df_healthcare[col].mean()
        df_healthcare[col].fillna(mean_val, inplace=True)
        print(f"\nFilled missing values in '{col}' with mean ({mean_val:.2f}).")

print("\nMissing values after handling:")
display(df_healthcare[numeric_cols].isnull().sum())

# 3. Standardize units (convert height from cm to meters).
if 'height_cm' in df_healthcare.columns:
    df_healthcare['height_m'] = df_healthcare['height_cm'] / 100
    df_healthcare.drop('height_cm', axis=1, inplace=True)
    print("\n'height_cm' converted to 'height_m'.")
    display(df_healthcare[['height_m']].head())


# 4. Correct inconsistent categorical labels (e.g., "M", "Male", "male" → "Male").
if 'gender' in df_healthcare.columns:
    df_healthcare['gender'] = df_healthcare['gender'].replace({'M': 'Male', 'male': 'Male', 'F': 'Female'})
    print("\nInconsistent 'gender' labels corrected.")
    display(df_healthcare['gender'].value_counts())

# 5. Drop irrelevant columns such as patient_id after cleaning.
if 'patient_id' in df_healthcare.columns:
    df_healthcare.drop('patient_id', axis=1, inplace=True)
    print("\n'patient_id' column dropped.")

print("\nColumns after dropping 'patient_id':")
display(df_healthcare.columns)


# 6. Display the cleaned data
print("\nCleaned healthcare DataFrame:")
display(df_healthcare.head())

# 7. Finish task
print("\nHealthcare patient records cleaning steps completed. The DataFrame 'df_healthcare' contains the cleaned data, suitable for ML model training.")

Synthetic healthcare patient records dataset created:


Unnamed: 0,patient_id,blood_pressure,heart_rate,height_cm,weight_kg,gender,condition
0,1,120.0,75.0,175,70,Male,Flu
1,2,135.0,82.0,162,65,Female,Cold
2,3,,78.0,180,80,male,Fever
3,4,140.0,,155,58,Female,Flu
4,5,118.0,80.0,170,72,M,Cold
5,6,130.0,79.0,168,68,Female,Fever
6,7,,85.0,172,75,Male,Flu
7,8,125.0,,160,62,Female,Cold
8,9,138.0,76.0,178,78,male,Fever
9,10,122.0,81.0,165,60,F,Flu



Filled missing values in 'blood_pressure' with mean (128.50).

Filled missing values in 'heart_rate' with mean (79.50).

Missing values after handling:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_healthcare[col].fillna(mean_val, inplace=True)


Unnamed: 0,0
blood_pressure,0
heart_rate,0



'height_cm' converted to 'height_m'.


Unnamed: 0,height_m
0,1.75
1,1.62
2,1.8
3,1.55
4,1.7



Inconsistent 'gender' labels corrected.


Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Male,5
Female,5



'patient_id' column dropped.

Columns after dropping 'patient_id':


Index(['blood_pressure', 'heart_rate', 'weight_kg', 'gender', 'condition',
       'height_m'],
      dtype='object')


Cleaned healthcare DataFrame:


Unnamed: 0,blood_pressure,heart_rate,weight_kg,gender,condition,height_m
0,120.0,75.0,70,Male,Flu,1.75
1,135.0,82.0,65,Female,Cold,1.62
2,128.5,78.0,80,Male,Fever,1.8
3,140.0,79.5,58,Female,Flu,1.55
4,118.0,80.0,72,Male,Cold,1.7



Healthcare patient records cleaning steps completed. The DataFrame 'df_healthcare' contains the cleaned data, suitable for ML model training.


**TASK 5**

In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# 1. Generate a simple synthetic financial dataset
data = {
    'date': pd.to_datetime(pd.date_range(start='2023-01-01', periods=50, freq='D')),
    'company_name': np.random.choice(['CompanyA', 'CompanyB', 'CompanyC'], size=50),
    'sector': np.random.choice(['Tech', 'Finance', 'Healthcare'], size=50),
    'stock_price': np.random.rand(50) * 100 + 50, # Price between 50 and 150
    'volume': np.random.rand(50) * 100000 + 10000 # Volume between 10000 and 110000
}

df_financial = pd.DataFrame(data)

# Introduce some missing values
np.random.seed(42) # for reproducibility
missing_stock_price_indices = np.random.choice(df_financial.index, size=5, replace=False)
missing_volume_indices = np.random.choice(df_financial.index, size=5, replace=False)

df_financial.loc[missing_stock_price_indices, 'stock_price'] = np.nan
df_financial.loc[missing_volume_indices, 'volume'] = np.nan


print("Synthetic financial dataset created:")
display(df_financial.head())
display(df_financial.tail())
display(df_financial.isnull().sum())


# 2. Handle missing values in stock price and volume.
# For simplicity, fill with the mean.
df_financial['stock_price'] = df_financial['stock_price'].fillna(df_financial['stock_price'].mean())
df_financial['volume'] = df_financial['volume'].fillna(df_financial['volume'].mean())

print("\nMissing values after handling:")
display(df_financial.isnull().sum())


# 3. Create new features such as moving average (7-day, 30-day).
# Ensure the data is sorted by date for correct moving average calculation
df_financial = df_financial.sort_values(by='date')

df_financial['stock_price_ma_7'] = df_financial['stock_price'].rolling(window=7).mean()
df_financial['stock_price_ma_30'] = df_financial['stock_price'].rolling(window=30).mean()

# Fill initial NaN values created by rolling window with the stock price itself
df_financial['stock_price_ma_7'] = df_financial['stock_price_ma_7'].fillna(df_financial['stock_price'])
df_financial['stock_price_ma_30'] = df_financial['stock_price_ma_30'].fillna(df_financial['stock_price'])


print("\nNew features (Moving Averages) created:")
display(df_financial[['date', 'stock_price', 'stock_price_ma_7', 'stock_price_ma_30']].head())


# 4. Normalize continuous variables using StandardScaler.
continuous_cols = ['stock_price', 'volume', 'stock_price_ma_7', 'stock_price_ma_30']
scaler = StandardScaler()

# Fit and transform the continuous columns
df_financial[continuous_cols] = scaler.fit_transform(df_financial[continuous_cols])

print("\nContinuous variables normalized using StandardScaler:")
display(df_financial.head())


# 5. Encode categorical columns (sector, company_name).
# Using one-hot encoding for simplicity
df_financial = pd.get_dummies(df_financial, columns=['sector', 'company_name'], dummy_na=False)

print("\nCategorical variables encoded:")
display(df_financial.head())


# 6. Display the feature-engineered DataFrame
print("\nFeature-engineered DataFrame:")
display(df_financial.head())

# 7. Finish task
print("\nFinancial dataset preprocessing and feature engineering completed. The DataFrame 'df_financial' is ready for ML tasks.")

Synthetic financial dataset created:


Unnamed: 0,date,company_name,sector,stock_price,volume
0,2023-01-01,CompanyB,Healthcare,100.471166,93778.661339
1,2023-01-02,CompanyB,Healthcare,61.883548,26065.14806
2,2023-01-03,CompanyC,Tech,100.434669,
3,2023-01-04,CompanyC,Finance,128.30483,23210.722743
4,2023-01-05,CompanyA,Tech,99.785353,83478.182378


Unnamed: 0,date,company_name,sector,stock_price,volume
45,2023-02-15,CompanyC,Tech,,
46,2023-02-16,CompanyB,Finance,144.613586,
47,2023-02-17,CompanyC,Healthcare,67.944457,38833.147807
48,2023-02-18,CompanyB,Tech,131.791904,89013.639092
49,2023-02-19,CompanyC,Tech,64.732447,45301.697203


Unnamed: 0,0
date,0
company_name,0
sector,0
stock_price,5
volume,5



Missing values after handling:


Unnamed: 0,0
date,0
company_name,0
sector,0
stock_price,0
volume,0



New features (Moving Averages) created:


Unnamed: 0,date,stock_price,stock_price_ma_7,stock_price_ma_30
0,2023-01-01,100.471166,100.471166,100.471166
1,2023-01-02,61.883548,61.883548,61.883548
2,2023-01-03,100.434669,100.434669,100.434669
3,2023-01-04,128.30483,128.30483,128.30483
4,2023-01-05,99.785353,99.785353,99.785353



Continuous variables normalized using StandardScaler:


Unnamed: 0,date,company_name,sector,stock_price,volume,stock_price_ma_7,stock_price_ma_30
0,2023-01-01,CompanyB,Healthcare,-0.063072,1.226488,-0.094316,-0.087029
1,2023-01-02,CompanyB,Healthcare,-1.376987,-1.30423,-2.9481,-1.840458
2,2023-01-03,CompanyC,Tech,-0.064315,0.0,-0.097015,-0.088688
3,2023-01-04,CompanyC,Finance,0.884669,-1.410911,1.96415,1.177738
4,2023-01-05,CompanyA,Tech,-0.086424,0.841519,-0.145036,-0.118193



Categorical variables encoded:


Unnamed: 0,date,stock_price,volume,stock_price_ma_7,stock_price_ma_30,sector_Finance,sector_Healthcare,sector_Tech,company_name_CompanyA,company_name_CompanyB,company_name_CompanyC
0,2023-01-01,-0.063072,1.226488,-0.094316,-0.087029,False,True,False,False,True,False
1,2023-01-02,-1.376987,-1.30423,-2.9481,-1.840458,False,True,False,False,True,False
2,2023-01-03,-0.064315,0.0,-0.097015,-0.088688,False,False,True,False,False,True
3,2023-01-04,0.884669,-1.410911,1.96415,1.177738,True,False,False,False,False,True
4,2023-01-05,-0.086424,0.841519,-0.145036,-0.118193,False,False,True,True,False,False



Feature-engineered DataFrame:


Unnamed: 0,date,stock_price,volume,stock_price_ma_7,stock_price_ma_30,sector_Finance,sector_Healthcare,sector_Tech,company_name_CompanyA,company_name_CompanyB,company_name_CompanyC
0,2023-01-01,-0.063072,1.226488,-0.094316,-0.087029,False,True,False,False,True,False
1,2023-01-02,-1.376987,-1.30423,-2.9481,-1.840458,False,True,False,False,True,False
2,2023-01-03,-0.064315,0.0,-0.097015,-0.088688,False,False,True,False,False,True
3,2023-01-04,0.884669,-1.410911,1.96415,1.177738,True,False,False,False,False,True
4,2023-01-05,-0.086424,0.841519,-0.145036,-0.118193,False,False,True,True,False,False



Financial dataset preprocessing and feature engineering completed. The DataFrame 'df_financial' is ready for ML tasks.
