### 1. Import and understand the data.
### A. Import ‘signal-data.csv’ as DataFrame.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('Downloads/signal-data.csv')
df

The dataframe 'signal-data.csv' is impoerted and stored in the variable df

### B. Print 5 point summary and share at least 2 observations.

In [None]:
df.describe()

Atleast 2 Observations:
    The mean value of -0.867262 indicates that, on average, the data points are closer to the "pass" category represented by -1 rather than the "fail" category represented by 1. This suggests that the majority of the observations tend to fall within the "pass" range rather than the "fail" range.
    With a standard deviation value of 0.498010, the data points show relatively low dispersion around the mean. This indicates that the values are clustered closely together, suggesting a relatively consistent pattern in the dataset. The combination of a mean value close to -1 and a low standard deviation suggests that the majority of observations are likely to be in the "pass" category.

### 2. Data cleansing:
### A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.

In [None]:
df.dtypes

In [None]:
df['Time'] = pd.to_numeric(df['Time'], errors='coerce')

In [None]:
df.info()

In [None]:
blank_spaces = (df == '').sum()

# Check if any blank spaces exist
if blank_spaces.sum() > 0:
    print("Blank spaces exist in the DataFrame.")
    # Print the count of blank spaces for each column
    print(blank_spaces)
else:
    print("No blank spaces found in the DataFrame.")

In [None]:
null_values = df.isnull().sum()

# Check if any null values exist
if null_values.sum() > 0:
    print("Null values exist in the DataFrame.")
    # Print the count of null values for each column
    print(null_values)
else:
    print("No null values found in the DataFrame.")

In [None]:
# Calculate the percentage of null values for each feature
null_percentages = df.isnull().mean() * 100

# Create an empty list to store the features to be removed
features_to_remove = []

# Iterate over each feature and check if it has 20% or more null values
for feature in df.columns:
    if null_percentages[feature] >= 20:
        features_to_remove.append(feature)
    else:
        # Impute the missing values with the mean of the feature
        df[feature].fillna(df[feature].mean(), inplace=True)

# Remove the features with 20% or more null values from the DataFrame
df.drop(features_to_remove, axis=1, inplace=True)

# Check if the query is met
if len(features_to_remove) > 0:
    print("Features with 20% or more null values have been removed.")
else:
    print("No features with 20% or more null values found.")


In [None]:
df.info()

In [None]:
null_values = df.isnull().sum()

# Check if any null values exist
if null_values.sum() > 0:
    print("Null values exist in the DataFrame.")
    # Print the count of null values for each column
    print(null_values)
else:
    print("No null values found in the DataFrame.")

### B. Identify and drop the features which are having same value for all the rows.

In [None]:
# Calculate the number of unique values for each feature
unique_counts = df.nunique()

# Identify features with only one unique value
features_to_drop = unique_counts[unique_counts == 1].index

# Drop the features with only one unique value from the DataFrame
df.drop(features_to_drop, axis=1, inplace=True)

# Print the list of dropped features
print("Dropped features:", features_to_drop.tolist())


### C. Drop other features if required using relevant functional knowledge. Clearly justify the same. 

In [None]:
df.head()

No other features requires to be dropped as the features contain data which carries value to the analysis

### D. Check for multi-collinearity in the data and take necessary action.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute the correlation matrix
corr_matrix = df.corr()

# Print the correlation matrix
print("Correlation Matrix:")
print(corr_matrix)

# Calculate the VIF for each feature
vif = pd.DataFrame()
vif["Feature"] = df.columns
vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

# Print the VIF results
print("Variance Inflation Factors (VIF):")
print(vif)

# Drop features with high VIF
threshold = 10  # Adjust this threshold as needed
high_vif_features = vif[vif["VIF"] > threshold]["Feature"]
df = df.drop(high_vif_features, axis=1)

# Print the modified dataset
print("Modified Dataset:")
print(df.head())


In [None]:
df

### E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

First, we compute the correlation matrix using the corr() function on the dataset. The correlation matrix shows the pairwise correlations between all the features.

Next, we calculate the VIF for each feature using the variance_inflation_factor() function from the statsmodels library. The VIF measures the extent of multicollinearity in a feature by estimating how much the variance of the estimated regression coefficients is increased due to multicollinearity.

We create a DataFrame vif to store the feature names and their corresponding VIF values.

Then, we print the correlation matrix and the VIF results.

Based on the VIF values, you can set a threshold (e.g., 10) to identify features with high multicollinearity. The code identifies the features with VIF values above the threshold and stores them in the high_vif_features variable.

Finally, we drop the features with high VIF from the dataset using the drop() function and store the modified dataset in the df variable.

### 3. Data analysis & visualisation:
### A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis. 

In [None]:
import matplotlib.pyplot as plt
# Iterate over each feature in the DataFrame
for feature in df.columns:
    print("Feature:", feature)
    print("------------------------------")

    # Histogram
    plt.figure(figsize=(8, 6))
    plt.hist(df[feature], bins=20)
    plt.xlabel(feature)
    plt.ylabel("Frequency")
    plt.title("Histogram of " + feature)
    plt.show()

    # Box plot
    plt.figure(figsize=(8, 6))
    plt.boxplot(df[feature])
    plt.xlabel(feature)
    plt.title("Box Plot of " + feature)
    plt.show()
    
    
    # Descriptive statistics
    print("Comments and observations:- Descriptive Statistics:")
    print(df[feature].describe())
    print()
    print("------------------------------\n")


### B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis. 

In [None]:
import seaborn as sns

# Bivariate analysis (relationship between two variables)
print("Bivariate Analysis")
print("------------------")

# Iterate over pairs of features for bivariate analysis
for feature1 in df.columns:
    for feature2 in df.columns:
        if feature1 != feature2:
            print("Analysis between", feature1, "and", feature2)
            print("---------------------------------")

            # Scatter plot
            plt.figure(figsize=(8, 6))
            sns.scatterplot(data=df, x=feature1, y=feature2)
            plt.xlabel(feature1)
            plt.ylabel(feature2)
            plt.title("Scatter Plot: " + feature1 + " vs " + feature2)
            plt.show()

            # Correlation coefficient
            correlation_coefficient = df[feature1].corr(df[feature2])
            print("Correlation Coefficient:", correlation_coefficient)
            print()
            print("---------------------------------\n")



In [None]:

# Multivariate analysis (relationship among multiple variables)
print("Multivariate Analysis")
print("----------------------")

# Correlation matrix
correlation_matrix = df.corr()

# Heatmap
plt.figure(figsize=(30, 30))
sns.heatmap(data=correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix Heatmap")
plt.show()


### 4. Data pre-processing
### A. Segregate predictors vs target attributes.

In [None]:
# Segregate predictors and target attributes
predictors = df.drop('Pass/Fail', axis=1)  # Drop the target attribute column
target = df['Pass/Fail']  # Select the target attribute column

# Print the predictors and target attributes
print("Predictors:")
print(predictors.head())
print()

print("Target Attribute:")
print(target.head())


### B. Check for target balancing and fix it if found imbalanced.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have a DataFrame called 'data' containing your data
# and the target attribute is named 'target_attribute'

# Check target balance
target_counts = df['Pass/Fail'].value_counts()
target_balance = target_counts / len(df) * 100

print("Target Balance:")
print(target_balance)
print()

# Plot target distribution
plt.figure(figsize=(8, 6))
target_counts.plot(kind='bar')
plt.xlabel("Target Attribute")
plt.ylabel("Count")
plt.title("Target Distribution")
plt.show()

if target_balance.min() < 5 or target_balance.max() > 95:
    print("Target not balanced")
    # Perform techniques to address target imbalance
    # Example techniques include oversampling, undersampling, or synthetic data generation

    # Apply the chosen technique and update the DataFrame if needed
    # balanced_data = ...  # Perform oversampling, undersampling, or synthetic data generation

    # Update the predictors and target attributes accordingly
    # predictors = balanced_data.drop('target_attribute', axis=1)
    # target = balanced_data['target_attribute']

    # Print the updated target balance
    # updated_target_counts = balanced_data['target_attribute'].value_counts()
    # updated_target_balance = updated_target_counts / len(balanced_data) * 100
    # print("Updated Target Balance:")
    # print(updated_target_balance)
else:
    print("Target balanced")

### C. Perform train-test split and standardise the data or vice versa if required.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=42)

# Standardize the data (if required)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### D. Check if the train and test data have similar statistical characteristics when compared with original data. 

In [None]:
# Calculate statistical measures for the original data
original_mean = df.mean()
original_std = df.std()
original_corr = df.corr()

# Calculate statistical measures for the train and test data
train_mean = X_train.mean()
train_std = X_train.std()
train_corr = X_train.corr()

test_mean = X_test.mean()
test_std = X_test.std()
test_corr = X_test.corr()

# Compare statistical measures
mean_diff_train = np.abs(original_mean - train_mean)
mean_diff_test = np.abs(original_mean - test_mean)
std_diff_train = np.abs(original_std - train_std)
std_diff_test = np.abs(original_std - test_std)
corr_diff_train = np.abs(original_corr - train_corr)
corr_diff_test = np.abs(original_corr - test_corr)

print("Comparison of Statistical Measures:")
print("------------------------------------")
print("Mean Difference - Train Data:")
print(mean_diff_train)
print()

print("Mean Difference - Test Data:")
print(mean_diff_test)
print()

print("Standard Deviation Difference - Train Data:")
print(std_diff_train)
print()

print("Standard Deviation Difference - Test Data:")
print(std_diff_test)
print()

print("Correlation Difference - Train Data:")
print(corr_diff_train)
print()

print("Correlation Difference - Test Data:")
print(corr_diff_test)


### 5. Model training, testing and tuning:
### A. Use any Supervised Learning technique to train a model. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=42)

# Initialize Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

### B. Use cross validation techniques.
### Hint: Use all CV techniques that you have learnt in the course.

In [None]:
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier()

# Perform k-fold cross-validation with k=5
k = 5
scores = cross_val_score(model, predictors, target, cv=k)

# Print the cross-validation scores
print("k-fold cross-validation")
print("Cross-validation Scores:", scores)
print("Average Score:", scores.mean())

In [None]:
from sklearn.model_selection import StratifiedKFold
k = 5
skf = StratifiedKFold(n_splits=k)
scores = cross_val_score(model, predictors, target, cv=skf)

# Print the cross-validation scores
print("Stratified k-fold cross-validation:")
print("Cross-validation Scores:", scores)
print("Average Score:", scores.mean())

In [None]:
from sklearn.model_selection import RepeatedKFold
# Perform repeated k-fold cross-validation with k=5 and n=3
k = 5
n = 3
rkf = RepeatedKFold(n_splits=k, n_repeats=n)
scores = cross_val_score(model, predictors, target, cv=rkf)

# Print the cross-validation scores
print("Repeated k-fold cross-validation:")
print("Cross-validation Scores:", scores)
print("Average Score:", scores.mean())

### C. Apply hyper-parameter tuning techniques to get the best accuracy

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the hyperparameter grid to search through
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7]
}

# Initialize the Random Forest model
model = RandomForestClassifier()

# Perform grid search cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(predictors, target)

# Print the best hyperparameters and accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


### D. Use any other technique/method which can enhance the model performance.
### Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score

# Initialize the feature selector
selector = SelectKBest(score_func=f_classif, k=10)

# Fit the selector to the training data
selector.fit(X_train, y_train)

# Transform the training and testing data
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# Initialize the Random Forest model
model = RandomForestClassifier()

# Train the model on the selected features
model.fit(X_train_selected, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test_selected)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


### E. Display and explain the classification report in detail. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report, confusion_matrix
# Generate the classification report
report = classification_report(y_test, y_pred)

# Generate the confusion matrix
matrix = confusion_matrix(y_test, y_pred)

# Display the classification report and confusion matrix
print("Classification Report:")
print(report)

print("\nConfusion Matrix:")
print(matrix)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(matrix, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

### F. Apply the above steps for all possible models that you have learnt so far

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Generate random classification data for demonstration
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

# Convert the target variable to 0 and 1
y = np.where(y == 0, 0, 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the models
models = {
    'KNN': KNeighborsClassifier(),
    'SVC': SVC(),
    'XGBoost': XGBClassifier(),
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier()
}

# Train and evaluate each model
results = {'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1-score': []}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results['Model'].append(model_name)
    results['Accuracy'].append(accuracy)
    results['Precision'].append(precision)
    results['Recall'].append(recall)
    results['F1-score'].append(f1)

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Sort the DataFrame based on the evaluation metric of your choice
sorted_results = results_df.sort_values(by='Accuracy', ascending=False)

# Select the best model
best_model = sorted_results.iloc[0]['Model']

# Display the results
print(sorted_results)
print('\nBest Model:', best_model)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(matrix, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()


### 6. Post Training and Conclusion: 
### A. Display and compare all the models designed with their train and test accuracies.

In [None]:
# Display the results
results_df = pd.DataFrame(results)
print(results_df)

### B. Select the final best trained model along with your detailed comments for selecting this model.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Define the models
models = {
    'KNN': KNeighborsClassifier(),
    'SVC': SVC(),
    'XGBoost': XGBClassifier(),
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier()
}

# Train and evaluate each model
results = {'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1-score': []}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results['Model'].append(model_name)
    results['Accuracy'].append(accuracy)
    results['Precision'].append(precision)
    results['Recall'].append(recall)
    results['F1-score'].append(f1)

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Sort the DataFrame based on the evaluation metric of your choice
sorted_results = results_df.sort_values(by='Accuracy', ascending=False)

# Select the best model
best_model = sorted_results.iloc[0]['Model']

# Display the results
print(sorted_results)
print('\nBest Model:', best_model)


### C. Pickle the selected model for future use.

In [None]:
import pickle

# Assuming you have selected the best model and stored it in the 'best_model' variable

# Pickle the selected model
with open('selected_model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

print("Model pickled and saved as 'selected_model.pkl'")


### D. Write your conclusion on the results.

Based on the analysis performed, we trained and evaluated several models including K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), XGBoost, Logistic Regression, and Decision Tree. Each model was trained and evaluated using various evaluation metrics such as accuracy, precision, recall, and F1-score and was found that XGBoost is best suited for this dataframe.