###Question 1:
Generate a dataset for linear regression with 1000 samples, 5 features and single target.

Visualize the data by plotting the target column against each feature column. Also plot the best fit line in each case.

Hint : search for obtaining regression line using numpy.

In [None]:
# importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

# generating data using make_regression of sklearn
X, y = make_regression(n_samples = 1000, n_features = 5, n_targets = 1, noise = 0)
# storing plots
fig, ax = plt.subplots(5,1, figsize = (10,10))

# running a for loop to generate all 5 scatter plots
for i in range(5):
    ax[i].scatter(X[:, i], y, color = 'b')
    ax[i].set_xlabel(f'Feature {i+1}')
    ax[i].set_ylabel('Target')
    
    # for best fit line in each case
    a, b = np.polyfit(X[:, i], y, deg = 1)
    
    ax[i].axline(xy1 = (0, b), slope = a, color = 'r', label = f'y = {a:.2f}x + {b:.2f}')
    ax[i].legend()

# show the plot
plt.tight_layout()
plt.show()

### Question 2:
Make a classification dataset of 1000 samples with 2 features, 2 classes and 2 clusters per class.
Plot the data.

In [None]:
# importing necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# generating classification dataset using make_classification from sklearn
X,y= make_classification(n_samples = 1000, n_features = 2, n_classes = 2, n_clusters_per_class = 2,  n_informative = 2, n_redundant = 0)

# plotting data on scatter plot
plt.scatter(X[:, 0],X[:,1], c=y)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

### Question 3:
Make a clustering dataset with 2 features and 4 clusters.

In [None]:
# importing necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# generation cluster dataset using make_blobs of sklearn
X,y = make_blobs(n_samples = 1000, centers = 4, n_features = 2)

# plotting dataset on a scatter plot
plt.scatter(X[:, 0], X[:,1], c = y)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

## Question 4
Go to the website https://www.worldometers.info/coronavirus/ and scrape the table containing covid-19 infection and deaths data using requests and BeautifulSoup.  Convert the table to a Pandas dataframe with the following columns : Country, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths,TotalRecovered, NewRecovered,  ActiveCases.

*(<b>Optional Challenge :</b> Change the data type of the Columns (Population ... till ActiveCases) to integer. For that you need to remove the commas and plus signs. You may need to use df.apply() and pd.to_numeric() . Take care of the values which are empty strings.)

In [None]:
# importing necessary libraries
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd

# setting up url, getting content from that url and store it in page variable
url = "https://www.worldometers.info/coronavirus"
page = rq.get(url)

# parsing data to soup
soup = BeautifulSoup(page.text, "html.parser")

# storing table from soup
table = soup.find("table", id = "main_table_countries_today")

# collecting rows from table
rows = table.find_all("tr")

# storing header of table
header = [th.text for th in rows[0].find_all("th")]

# storing all the data of table in data variable using for loop
data = []
for row in rows[1:]:
    data.append([td.text for td in row.find_all("td")])

# store this data in pandas DataFrame
df = pd.DataFrame(data, columns = header)

df = df[["Country,Other", "Continent", "Population", "TotalCases", "NewCases", "TotalDeaths", "NewDeaths", "TotalRecovered", "NewRecovered", "ActiveCases"]]

df.columns = ["Country", "Continent", "Population", "TotalCases", "NewCases", "TotalDeaths", "NewDeaths","TotalRecovered", "NewRecovered", "ActiveCases"]

# printing data
print(df)

# Question 5

Generate an imbalanced classification dataset using sklearn of 1000 samples with 2 features, 2 classes and 1 cluster per class. Plot the data. One of the class should contain only 5% of the total samples. Confirm this either using numpy or Counter. Plot the data.

Now oversample the minority class to 5 times its initial size using SMOTE. Verify the number. Plot the data.

Now undersample the majority class to 3 times the size of minority class using RandomUnderSampler. Verify the number. Plot the data.

Reference : Last markdown cell of the examples.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

X, y = make_blobs(n_samples=1000, n_features=2, centers=2, cluster_std=1.0, random_state=42)
y[0:950] = 0
y[950:] = 1

print("Class distribution using numpy:")
print(np.bincount(y))

print("Class distribution using Counter:")
print(Counter(y))

plt.scatter(X[:,0], X[:,1], c=y, cmap='rainbow')
plt.title("Imbalanced classification dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

sm = SMOTE(sampling_strategy=0.25, random_state=42)
X_sm, y_sm = sm.fit_resample(X, y)

print("Number of samples after oversampling:")
print(X_sm.shape[0])

print("Class distribution after oversampling using numpy:")
print(np.bincount(y_sm))

print("Class distribution after oversampling using Counter:")
print(Counter(y_sm))

plt.scatter(X_sm[:,0], X_sm[:,1], c=y_sm, cmap='rainbow')
plt.title("Oversampled classification dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

rus = RandomUnderSampler(sampling_strategy=1.5, random_state=42)
X_rus, y_rus = rus.fit_resample(X_sm, y_sm)

print("Number of samples after undersampling:")
print(X_rus.shape[0])

print("Class distribution after undersampling using numpy:")
print(np.bincount(y_rus))

print("Class distribution after undersampling using Counter:")
print(Counter(y_rus))

plt.scatter(X_rus[:,0], X_rus[:,1], c=y_rus, cmap='rainbow')
plt.title("Undersampled classification dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

##Question 6

Write a Python code to perform data preprocessing on a dataset using the scikit-learn library. Follow the instructions below:

 * Load the dataset using the scikit-learn `load_iris` function.
 * Assign the feature data to a variable named `X` and the target data to a variable named `y`.
 * Create a pandas DataFrame called `df` using `X` as the data and the feature names obtained from the dataset.
 * Display the first 5 rows of the DataFrame `df`.
 *  Check if there are any missing values in the DataFrame and handle them accordingly.
 * Split the data into training and testing sets using the `train_test_split` function from scikit-learn. Assign 70% of the data to the training set and the remaining 30% to the testing set.
 * Print the dimensions of the training set and testing set respectively.
 *  Standardize the feature data in the training set using the `StandardScaler` from scikit-learn.
 *  Apply the same scaling transformation on the testing set.
 * Print the first 5 rows of the standardized training set.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
iris = load_iris()

# Assign feature data and target data
X = iris.data
y = iris.target

# Create a pandas DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)

# Display the first 5 rows of the DataFrame
print(df.head())

# Check for missing values
print("Missing Values:", df.isnull().sum())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the dimensions of the training and testing sets
print("Training Set Dimensions:", X_train.shape)
print("Testing Set Dimensions:", X_test.shape)

# Standardize the feature data in the training set
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same scaling transformation on the testing set
X_test_scaled = scaler.transform(X_test)

# Print the first 5 rows of the standardized training set
df_train_scaled = pd.DataFrame(X_train_scaled, columns=iris.feature_names)
print(df_train_scaled.head())