# Dataset Loading
The first step in any machine learning project is to load the dataset. This typically involves reading the data from a file or database and creating a data structure that can be used by the machine learning algorithms. <br> 
The dataset can be split into three sets - training set, validation set, and test set.

Before loading the dataset, it is important to understand the data and its features. This can be done by analyzing the data distribution, the correlation between features, and identifying missing or invalid values. Once the dataset is loaded, it is important to split it into training, validation, and test sets.

In [17]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [18]:
df = pd.read_csv("../Datasets/IMDB_Urdu_Reviews/train.csv")

In [19]:
df.head()

Unnamed: 0,review,sentiment
0,میں نے اسے 80 کی دہائی کے وسط میں ایک کیبل گائ...,positive
1,چونکہ میں نے 80 کی دہائی میں انسپکٹر گیجٹ کارٹ...,negative
2,ایک ایسے معاشرے کی حالت کے بارے میں تعجب کرتا ...,positive
3,مفید البرٹ پیون کی طرف سے ایک اور ردی کی ٹوکری...,negative
4,یہ کولمبو ہے جس کی ہدایتکاری اپنے کیریئر کے اب...,positive


In [20]:
df.shape

(40000, 2)

In [21]:
df.describe()

Unnamed: 0,review,sentiment
count,40000,40000
unique,39737,2
top,آج کا شو پسند آیا !!! یہ ایک قسم تھی اور نہ صر...,negative
freq,4,20082


In [22]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [27]:
le = LabelEncoder()

In [28]:
df["labels"] = le.fit_transform(df.sentiment)

In [29]:
df.head()

Unnamed: 0,review,sentiment,labels
0,میں نے اسے 80 کی دہائی کے وسط میں ایک کیبل گائ...,positive,1
1,چونکہ میں نے 80 کی دہائی میں انسپکٹر گیجٹ کارٹ...,negative,0
2,ایک ایسے معاشرے کی حالت کے بارے میں تعجب کرتا ...,positive,1
3,مفید البرٹ پیون کی طرف سے ایک اور ردی کی ٹوکری...,negative,0
4,یہ کولمبو ہے جس کی ہدایتکاری اپنے کیریئر کے اب...,positive,1


In [None]:
import urduhack
from urduhack.preprocessing import remove_punctuation, remove_accents

In [34]:
df["review"] = df["review"].apply(lambda x: remove_punctuation(x))
df["review"] = df["review"].apply(lambda x: remove_accents(x))

In [35]:
df["review"] = df["review"].apply(lambda x: urduhack.tokenize_words(x))

AttributeError: module 'urduhack' has no attribute 'tokenize_words'

In [None]:
df["review"] = df["review"].apply(lambda x: [urduhack.stemmer.stem(w) for w in x])

In [None]:
df["review"] = df["review"].apply(lambda x: " ".join(x))

# Training, Validation and Test Sets
The training set is used to train the machine learning model. The validation set is used to fine-tune the model by adjusting hyperparameters or other settings. Finally, the test set is used to evaluate the model's performance on unseen data.

<b>Training Set:</b>
The training set is used to train the machine learning model. It typically consists of a large number of data samples, which are used to fit the model's parameters. The goal of training is to minimize the difference between the predicted output of the model and the actual output of the training set.

<b>Validation Set:</b>
The validation set is used to fine-tune the model by adjusting hyperparameters or other settings. It is also used to evaluate the model's performance during training. The validation set is typically smaller than the training set, and it is important that it is representative of the data distribution.

<b>Test Set:</b>
The test set is used to evaluate the model's performance on unseen data. It is important that the test set is completely separate from the training and validation sets to ensure that the model has not been overfitted to the training data. The test set should be large enough to provide a statistically significant evaluation of the model's performance.

<b>Iterated K-Fold Validation with Shuffling:</b>
In situations where there is relatively little data available and you need to evaluate your model as precisely as possible, iterated k-fold validation with shuffling can be used. This involves applying k-fold validation multiple times, shuffling the data every time before splitting it k ways. The final score is the average of the scores obtained at each run of k-fold validation. It is important to note that this method can be computationally expensive since

## Splitting the dataset into training, validation, and test sets

In [10]:
from sklearn.model_selection import train_test_split

In [12]:
# Split the dataset into 80% training and 20% test sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Split the training data further into 80% training and 20% validation sets
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

## Alternatively, we can use the K-fold cross-validation technique to split the dataset

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import KFold

In [None]:
vectorizer = CountVectorizer(stop_words='urdu', max_features=5000)
X = vectorizer.fit_transform(df.reviews)
y = df.

In [None]:

# Define the number of folds (K)
K = 5

# Initialize the K-fold cross-validator
kfold = KFold(n_splits=K, shuffle=True, random_state=42)

# Split the dataset into K folds and iterate over them
validation_scores = []
for fold, (train_index, val_index) in enumerate(kfold.split(X)):
    # Get the training and validation data for this fold
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    
    # Train an SVM model on the training data and evaluate it on the validation data
    model = SVC(kernel='linear')
    model.fit(X_train, y_train)
    val_score = model.score(X_val, y_val)
    
    # Do something with the validation score, such as computing the average
    validation_scores.append(val_score)
    
# Compute the final score as the average of the validation scores
final_score = np.mean(validation_scores)
print(f'Average validation score: {final_score:.4f}')