Semiconductor Data Classification

1. Import and Explore the Data
Let's start by importing the necessary libraries and loading the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Load the dataset
df = pd.read_csv('sensor-data.csv')

# Explore the data
print(df.head())
print(df.info())
print(df.describe())


2. Data Cleansing
We'll handle missing values, drop any irrelevant columns, and ensure that the data is clean for further analysis.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

# Drop columns with excessive missing values (if any) or apply imputation
# For example, drop a column:
# df.drop(columns=['column_name'], inplace=True)

# For this example, we'll fill missing values with the median
df.fillna(df.median(), inplace=True)

# Check for any non-numeric columns that might need conversion
print(df.select_dtypes(include=['object']).columns)


3. Data Analysis & Visualization
Performing statistical analysis and visualizations to understand the data better.

In [None]:
# Univariate analysis: distribution of target variable
sns.countplot(df['target_column'])
plt.show()

# Bivariate analysis: correlation heatmap
corr = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.show()

# More advanced plots can be added here based on insights.


4. Data Pre-processing
Segregate the data into features and target variables, handle target balancing, and split into training and testing sets.

In [None]:
# Segregate predictors vs target attributes
X = df.drop('target_column', axis=1)
y = df['target_column']

# Check for target imbalance
print(y.value_counts())

# If the target is imbalanced, apply SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Check for similarity in statistical characteristics
print("Training data stats:")
print(pd.DataFrame(X_train).describe())
print("Test data stats:")
print(pd.DataFrame(X_test).describe())


5. Model Training, Testing, and Tuning
Train different models, evaluate them, and fine-tune for the best results.

In [None]:
# Model 1: Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Classifier Report")
print(classification_report(y_test, y_pred_rf))

# Model 2: Support Vector Machine
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
print("Support Vector Machine Report")
print(classification_report(y_test, y_pred_svc))

# Model 3: Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Report")
print(classification_report(y_test, y_pred_nb))

# Model comparison
models = ['Random Forest', 'SVM', 'Naive Bayes']
train_accuracies = [cross_val_score(rf, X_train, y_train, cv=5).mean(),
                    cross_val_score(svc, X_train, y_train, cv=5).mean(),
                    cross_val_score(nb, X_train, y_train, cv=5).mean()]

test_accuracies = [rf.score(X_test, y_test),
                   svc.score(X_test, y_test),
                   nb.score(X_test, y_test)]

# Display the accuracies
print("Train Accuracies:", train_accuracies)
print("Test Accuracies:", test_accuracies)

# Choose the best model
best_model = models[np.argmax(test_accuracies)]
print(f"The best model is: {best_model}")

# Save the model for future use
import joblib
joblib.dump(rf, 'best_model.pkl')


6. Conclusion and Improvisation
Finally, we summarize the findings and suggest possible improvements.

In [None]:
print(f"The Random Forest model gave the best accuracy on the test data, indicating that it's well-suited for this classification problem. Other models like SVM and Naive Bayes also performed adequately, but the Random Forest outperformed them. Future work could involve exploring deeper hyperparameter tuning, trying ensemble methods, or incorporating more advanced feature engineering techniques to further boost accuracy.")
