###Question 1:
Generate a dataset for linear regression with 1000 samples, 5 features and single target.

Visualize the data by plotting the target column against each feature column. Also plot the best fit line in each case.

Hint : search for obtaining regression line using numpy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Generate random data
num_samples = 1000
num_features = 5

# Generate features (X) from a random normal distribution
X = np.random.randn(num_samples, num_features)

# Generate target values (y) from a linear combination of features with some noise
true_coefficients = np.array([2, -1, 0.5, 1.5, -0.5])
noise = np.random.randn(num_samples) * 0.5
y = np.dot(X, true_coefficients) + noise

# Visualize the data
fig, axs = plt.subplots(num_features, 1, figsize=(6, 20))

for i in range(num_features):
    ax = axs[i]
    ax.scatter(X[:, i], y, s=5, alpha=0.5)
    ax.set_xlabel('Feature {}'.format(i+1))
    ax.set_ylabel('Target')
    ax.set_title('Feature {} vs Target'.format(i+1))

    # Calculate best fit line
    coefficients = np.polyfit(X[:, i], y, 1)
    best_fit_line = np.polyval(coefficients, X[:, i])
    
    # Plot best fit line
    ax.plot(X[:, i], best_fit_line, color='red')

plt.tight_layout()
plt.show()

### Question 2:
Make a classification dataset of 1000 samples with 2 features, 2 classes and 2 clusters per class.
Plot the data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate classification dataset
num_samples = 1000
num_features = 2
num_classes = 2
num_clusters_per_class = 2

X, y = make_blobs(n_samples=num_samples, n_features=num_features, centers=num_classes * num_clusters_per_class,
                  cluster_std=1.0, center_box=(-10.0, 10.0), random_state=42)

# Plot the data
plt.figure(figsize=(8, 6))
for class_label in range(num_classes):
    class_samples = X[y == class_label]
    plt.scatter(class_samples[:, 0], class_samples[:, 1], label='Class {}'.format(class_label))

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Classification Dataset')
plt.legend()
plt.grid(True)
plt.show()

### Question 3:
Make a clustering dataset with 2 features and 4 clusters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Set random seed for reproducibility
np.random.seed(42)

# Generate clustering dataset
num_samples = 1000
num_features = 2
num_clusters = 4

X, y = make_blobs(n_samples=num_samples, n_features=num_features, centers=num_clusters,
                  cluster_std=1.0, center_box=(-10.0, 10.0), random_state=42)

# Plot the data
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Clustering Dataset')
plt.grid(True)
plt.colorbar()
plt.show()

## Question 4
Go to the website https://www.worldometers.info/coronavirus/ and scrape the table containing covid-19 infection and deaths data using requests and BeautifulSoup.  Convert the table to a Pandas dataframe with the following columns : Country, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths,TotalRecovered, NewRecovered,  ActiveCases.

*(<b>Optional Challenge :</b> Change the data type of the Columns (Population ... till ActiveCases) to integer. For that you need to remove the commas and plus signs. You may need to use df.apply() and pd.to_numeric() . Take care of the values which are empty strings.)

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the website
url = 'https://www.worldometers.info/coronavirus/'
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table element with class 'table' and extract the table data
table = soup.find('table', class_='table')
table_rows = table.find_all('tr')

# Extract the table headers (column names)
headers = []
for th in table_rows[0].find_all('th'):
    headers.append(th.text.strip())

# Extract the table rows (data rows)
rows = []
for row in table_rows[1:]:
    data = []
    for td in row.find_all('td'):
        data.append(td.text.strip())
    rows.append(data)

# Create a Pandas dataframe from the extracted table data
df = pd.DataFrame(rows, columns=headers)
#print(df.columns)

# Select the desired columns
columns_to_keep = ['Country,Other', 'Continent', 'Population', 'TotalCases', 'NewCases',
                  'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'NewRecovered', 'ActiveCases']
df = df[columns_to_keep]

# Print the resulting dataframe
print(df)

# Question 5

Generate an imbalanced classification dataset using sklearn of 1000 samples with 2 features, 2 classes and 1 cluster per class. Plot the data. One of the class should contain only 5% of the total samples. Confirm this either using numpy or Counter. Plot the data.

Now oversample the minority class to 5 times its initial size using SMOTE. Verify the number. Plot the data.

Now undersample the majority class to 3 times the size of minority class using RandomUnderSampler. Verify the number. Plot the data.

Reference : Last markdown cell of the examples.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Set random seed for reproducibility
np.random.seed(42)

# Generate imbalanced classification dataset
num_samples = 1000
num_features = 2
num_clusters_per_class = 1
class_ratio = 0.05

X, y = make_classification(n_samples=num_samples, n_features=num_features, n_informative=num_features,
                           n_redundant=0, n_clusters_per_class=num_clusters_per_class,
                           weights=[class_ratio, 1-class_ratio], flip_y=0, random_state=42)

# Plot the original data
plt.figure(figsize=(8, 6))
plt.scatter(X[y == 0, 0], X[y == 0, 1], label='Class 0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Imbalanced Classification Dataset (Original)')
plt.legend()
plt.grid(True)
plt.show()

# Confirm class imbalance using Counter
print('Class distribution (Original):', Counter(y))

# Oversample the minority class using SMOTE
oversampler = SMOTE(sampling_strategy=0.2, random_state=42)
X_oversampled, y_oversampled = oversampler.fit_resample(X, y)

# Plot the oversampled data
plt.figure(figsize=(8, 6))
plt.scatter(X_oversampled[y_oversampled == 0, 0], X_oversampled[y_oversampled == 0, 1], label='Class 0')
plt.scatter(X_oversampled[y_oversampled == 1, 0], X_oversampled[y_oversampled == 1, 1], label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Imbalanced Classification Dataset (Oversampled)')
plt.legend()
plt.grid(True)
plt.show()

# Confirm class imbalance after oversampling
print('Class distribution (Oversampled):', Counter(y_oversampled))

# Undersample the majority class using RandomUnderSampler
undersampler = RandomUnderSampler(sampling_strategy=0.15, random_state=42)
X_undersampled, y_undersampled = undersampler.fit_resample(X, y)

# Plot the undersampled data
plt.figure(figsize=(8, 6))
plt.scatter(X_undersampled[y_undersampled == 0, 0], X_undersampled[y_undersampled == 0, 1], label='Class 0')
plt.scatter(X_undersampled[y_undersampled == 1, 0], X_undersampled[y_undersampled == 1, 1], label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Imbalanced Classification Dataset (Undersampled)')
plt.legend()
plt.grid(True)
plt.show()

# Confirm class imbalance after undersampling
print('Class distribution (Undersampled):', Counter(y_undersampled))

##Question 6

Write a Python code to perform data preprocessing on a dataset using the scikit-learn library. Follow the instructions below:

 * Load the dataset using the scikit-learn `load_iris` function.
 * Assign the feature data to a variable named `X` and the target data to a variable named `y`.
 * Create a pandas DataFrame called `df` using `X` as the data and the feature names obtained from the dataset.
 * Display the first 5 rows of the DataFrame `df`.
 *  Check if there are any missing values in the DataFrame and handle them accordingly.
 * Split the data into training and testing sets using the `train_test_split` function from scikit-learn. Assign 70% of the data to the training set and the remaining 30% to the testing set.
 * Print the dimensions of the training set and testing set respectively.
 *  Standardize the feature data in the training set using the `StandardScaler` from scikit-learn.
 *  Apply the same scaling transformation on the testing set.
 * Print the first 5 rows of the standardized training set.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
iris = load_iris()

# Assign the feature data to X and target data to y
X = iris.data
y = iris.target

# Create a pandas DataFrame using X and feature names
df = pd.DataFrame(X, columns=iris.feature_names)

# Display the first 5 rows of the DataFrame
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the dimensions of the training and testing sets
print("Training set dimensions:", X_train.shape)
print("Testing set dimensions:", X_test.shape)

# Standardize the feature data in the training set
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same scaling transformation on the testing set
X_test_scaled = scaler.transform(X_test)

# Print the first 5 rows of the standardized training set
print(pd.DataFrame(X_train_scaled, columns=iris.feature_names).head())