<a href="https://colab.research.google.com/github/Rasuka12/ML-Wizards/blob/main/ML_Wizards.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load and preprocess data




In [31]:
import pandas as pd

df = pd.read_csv('/content/Nepal_updated.csv')
display(df.head())
display(df.info())

Unnamed: 0,id,statement,label,source,language,category
0,N001,Nepal government allocates Rs 1647 billion bud...,Real,government_website,English,budget
1,N002,Ministry of Education announces revised School...,Real,ministry_document,English,education
2,N003,Department of Agriculture launches subsidized ...,Real,department_notice,English,agriculture
3,N004,Ministry of Finance releases guidelines for di...,Real,government_website,English,digital_governance
4,N005,Nepal Law Commission publishes updated legal f...,Real,legal_document,English,cybersecurity


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         117 non-null    object
 1   statement  117 non-null    object
 2   label      117 non-null    object
 3   source     117 non-null    object
 4   language   117 non-null    object
 5   category   117 non-null    object
dtypes: object(6)
memory usage: 5.6+ KB


None

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenize and create TF-IDF features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(df['statement'])

display(tfidf_features.shape)

(117, 856)

## Split data


In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df['label'], test_size=0.3, random_state=42)

display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(81, 856)

(36, 856)

(81,)

(36,)

## Train classifiers

Gradient Boosting, Random Forrest, SVM, Decision Tree, Naive Bayes using the training data.


In [34]:
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Instantiate and train Gradient Boosting classifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Instantiate and train Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Instantiate and train SVM classifier
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)

# Instantiate and train Decision Tree classifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# Instantiate and train Gaussian Naive Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train.toarray(), y_train)

## Evaluate models



In [35]:
from sklearn.metrics import accuracy_score

# Make predictions with Gradient Boosting and calculate accuracy
gb_predictions = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_predictions)
print(f'Gradient Boosting Accuracy: {gb_accuracy}')

# Make predictions with Random Forest and calculate accuracy
rf_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f'Random Forest Accuracy: {rf_accuracy}')

# Make predictions with SVM and calculate accuracy
svm_predictions = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f'SVM Accuracy: {svm_accuracy}')

# Make predictions with Decision Tree and calculate accuracy
dt_predictions = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)
print(f'Decision Tree Accuracy: {dt_accuracy}')

# Make predictions with Naive Bayes and calculate accuracy
nb_predictions = nb_model.predict(X_test.toarray())
nb_accuracy = accuracy_score(y_test, nb_predictions)
print(f'Naive Bayes Accuracy: {nb_accuracy}')

Gradient Boosting Accuracy: 0.75
Random Forest Accuracy: 0.7777777777777778
SVM Accuracy: 0.75
Decision Tree Accuracy: 0.7222222222222222
Naive Bayes Accuracy: 0.6944444444444444


## Compare models


In [36]:
# Store accuracy scores in a dictionary
accuracy_scores = {
    'Decision Tree': dt_accuracy,
    'Naive Bayes': nb_accuracy,
    'Gradient Boosting': gb_accuracy,
    'Random Forest': rf_accuracy,
    'SVM': svm_accuracy
}

# Print accuracy scores
print("Accuracy Scores:")
for model_name, accuracy in accuracy_scores.items():
    print(f"{model_name}: {accuracy:.4f}")

# Find the model with the highest accuracy
best_model_name = max(accuracy_scores, key=accuracy_scores.get)
highest_accuracy = accuracy_scores[best_model_name]

# Print the best performing model
print(f"\nBest performing model: {best_model_name} with accuracy: {highest_accuracy:.4f}")

Accuracy Scores:
Decision Tree: 0.7222
Naive Bayes: 0.6944
Gradient Boosting: 0.7500
Random Forest: 0.7778
SVM: 0.7500

Best performing model: Random Forest with accuracy: 0.7778


## Summary:

### Data Analysis Key Findings

*   Three models (Logistic Regression, Decision Tree, and Random Forest) achieved the highest accuracy of approximately 0.7778.
*   Gradient Boosting and SVM models had an accuracy of 0.75.
*   Naive Bayes had the lowest accuracy at approximately 0.6944.




## Create ensemble model



In [37]:
from sklearn.ensemble import VotingClassifier

# Create a list of classifiers
estimators = [
    ('Decision Tree', dt_model),
    ('Naive Bayes', nb_model),
    ('Gradient Boosting', gb_model),
    ('Random Forest', rf_model),
    ('SVM', svm_model)
]

# Instantiate a Voting Classifier with soft voting
voting_clf = VotingClassifier(estimators=estimators, voting='soft')

# Fit the Voting Classifier to the training data
voting_clf.fit(X_train.toarray(), y_train)

## Evaluate ensemble model


In [38]:
#from sklearn.metrics import accuracy_score

# Make predictions with the Voting Classifier
#voting_predictions = voting_clf.predict(X_test.toarray())

# Calculate the accuracy score
#voting_accuracy = accuracy_score(y_test, voting_predictions)

# Print the accuracy
#print(f'Voting Classifier Accuracy: {voting_accuracy}')

In [39]:
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# Modify SVC to enable probability estimates
svm_model = SVC(random_state=42, probability=True)
svm_model.fit(X_train, y_train)

# Create a list of classifiers with the modified SVM model
estimators = [
    ('Decision Tree', dt_model),
    ('Gradient Boosting', gb_model),
    ('Random Forest', rf_model),
    ('Naive Bayes', nb_model),
    ('SVM', svm_model)
]

# Instantiate a Voting Classifier with soft voting
voting_clf = VotingClassifier(estimators=estimators, voting='soft')

# Fit the Voting Classifier to the training data
voting_clf.fit(X_train.toarray(), y_train)

# Make predictions with the Voting Classifier
voting_predictions = voting_clf.predict(X_test.toarray())

# Calculate the accuracy score
voting_accuracy = accuracy_score(y_test, voting_predictions)

# Print the accuracy
print(f'Voting Classifier Accuracy: {voting_accuracy}')

Voting Classifier Accuracy: 0.8333333333333334


## Compare ensemble with individual models




In [40]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Store the accuracy of the Voting Classifier
accuracy_scores['Voting Classifier (Reduced)'] = voting_accuracy



# Print accuracy scores
print("Accuracy Scores:")
for model_name, accuracy in accuracy_scores.items():
    print(f"{model_name}: {accuracy:.4f}")

# Find the model with the highest accuracy among all models
best_model_name = max(accuracy_scores, key=accuracy_scores.get)
highest_accuracy = accuracy_scores[best_model_name]

# Print the best performing model
print(f"\nBest performing model (Accuracy): {best_model_name} with accuracy: {highest_accuracy:.4f}")

# Calculate and print precision, recall, and F1-score for the Voting Classifier
voting_precision = precision_score(y_test, voting_predictions, average='weighted')
voting_recall = recall_score(y_test, voting_predictions, average='weighted')
voting_f1 = f1_score(y_test, voting_predictions, average='weighted')

print(f"\nVoting Classifier (Reduced) Metrics:")
print(f"Precision: {voting_precision:.4f}")
print(f"Recall: {voting_recall:.4f}")
print(f"F1-score: {voting_f1:.4f}")

Accuracy Scores:
Decision Tree: 0.7222
Naive Bayes: 0.6944
Gradient Boosting: 0.7500
Random Forest: 0.7778
SVM: 0.7500
Voting Classifier (Reduced): 0.8333

Best performing model (Accuracy): Voting Classifier (Reduced) with accuracy: 0.8333

Voting Classifier (Reduced) Metrics:
Precision: 0.8343
Recall: 0.8333
F1-score: 0.8330


## Summary:

### Data Analysis Key Findings

*   The Voting Classifier ensemble model achieved an accuracy of 0.8333 on the test data.
*   The ensemble model outperformed all individual models: Logistic Regression (0.7778), Decision Tree (0.7778), Naive Bayes (0.6944), Gradient Boosting (0.7500), Random Forest (0.7778), and SVM (0.7500).
*   The Voting Classifier was the best performing model among all evaluated models.

