<a href="https://colab.research.google.com/github/NDEGEJACKSON1/setup-for-a-zindi-challenge/blob/main/stressDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import necessary libraries (e.g. pandas, numpy, sklearn).

Load the data (e.g. a CSV file containing social media posts and labels indicating whether the post was written by a stressed individual).

In [None]:
import pandas as pd
import numpy as np
import sklearn as skl
import re

from string import punctuation 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier #classifier
from sklearn.metrics import accuracy_score #evaluation metric

In [None]:
testData = pd.read_csv('test.csv')
trainData = pd.read_csv('train.csv')
sampleData = pd.read_csv('sample.csv')

In [None]:
sampleData.isna().sum()

In [None]:
print(testData.head())

In [None]:
testData.isna().sum()

In [None]:
print(trainData.head())

In [None]:
trainData.isna().sum()

Preprocess the data. This may include cleaning the text (e.g. removing special characters, stemming words), and possibly creating new features (e.g. counting the number of exclamation points in a post).

In [None]:
# a simple function to clean text data 

def text_cleaning(text):
    # Clean the text data

    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers
    text = text.lower()  # set in lowercase 
        
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
        
    # Return a list of words
    return(text)

In [None]:
testData["text"] = testData["text"].apply(text_cleaning)
trainData["text"] = trainData["text"].apply(text_cleaning)

Split the data into a training set and a testing set.

In [None]:
#split features and target from train data 
X = trainData["text"]
y = trainData.label.values

In [None]:
# Transform text data 
vectorizer = CountVectorizer(lowercase=False)

vectorizer.fit(X)

#transform train data 
X_transformed = vectorizer.transform(X)

#transform test data
test_transformed = vectorizer.transform(testData["text"])

Train a machine learning model on the training data. This could be a classifier such as a support vector machine or a random forest.

In [None]:
# split data into train and validate
X_train, X_valid, y_train, y_valid = train_test_split(
    X_transformed,
    y,
    test_size=0.10,
    random_state=42,
    shuffle=True,
    stratify=y,
)

In [None]:
# Create a classifier
stress_classifier = ExtraTreesClassifier() 

In [None]:
# train the stress_classifier 
stress_classifier.fit(X_train,y_train)

In [None]:
# test model performance on valid data 
y_preds = stress_classifier.predict(X_valid)

# evalute model performance by using accuracy_score in the validation data
accuracy_score(y_valid, y_preds) 

In [None]:
# train the tweets_classifier 
stress_classifier.fit(X_train,y_train)

Evaluate the model on the testing data. This might include calculating the precision, recall, and F1 score.

In [None]:
# create prediction from the test data
test_preds = stress_classifier.predict(test_transformed)
test_preds

In [None]:
from sklearn.metrics import precision_score, recall_score

# Calculate precision
precision = precision_score(y_valid, y_preds)

# Calculate recall
recall = recall_score(y_valid, y_preds)

# Print the results
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))


Fine-tune the model by adjusting the hyperparameters (e.g. the learning rate, the regularization parameter).

Save the trained model to disk. 
for now a'm created submission file for competition

In [None]:
# create submission file 
testData["label"] = test_preds

In [None]:
# show sample submissoin rows
testData.head() 

In [None]:
# save submission file
testData = testData[['post_id','label']] 
testData.to_csv("first_submission.csv", index=False) 

Load the trained model and use it to make predictions on new, unseen data.