# Natural Language Processing with Disaster Tweets
The goal of the project is to build a machine learning model that can accurately predict whether
a tweet is about a real disaster or not. The outcome of the project will be a model that can be
used to classify new, unseen tweets, helping disaster relief organizations and news agencies to
quickly identify tweets that are relevant to disasters.

## Import the necessary libraries:
In this step, we are importing the necessary libraries that we will need to build and evaluate a logistic regression model for text classification.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## Load the data files into Pandas dataframes:
Here, we are loading the training and test data files into separate pandas dataframes. These dataframes will be used to train and evaluate the model.

In [2]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [3]:
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

And one that is:

In [4]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Define the feature vector and the target variable:

At this step, we are defining the feature vector (X) and the target variable (y) for the logistic regression model. We are using the CountVectorizer from scikit-learn to convert the text into a numerical format that can be used for training the model.

In [5]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_df['text'])
y = train_df['target']

## Split the data into training and testing sets:
Now, we are splitting the data into a training set and a testing set. The training set will be used to train the logistic regression model, while the testing set will be used to evaluate the model's performance.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train a Logistic Regression model:
Within this process, we are training a logistic regression model using the training data. We are using scikit-learn's LogisticRegression class to build the model.

In [7]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Evaluate the model on the test set:
For this part, we are evaluating the performance of the logistic regression model on the testing set. We are using the score method of the logistic regression model to compute the accuracy of the model on the testing data.

In [8]:
accuracy = lr.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8095863427445831


## Make predictions on the test data:
Now we are making predictions on the test data using the trained logistic regression model. We are using the transform method of the CountVectorizer to convert the test data into the same numerical format as the training data, and then using the predict method of the logistic regression model to make predictions.

In [9]:
X_test = vectorizer.transform(test_df['text'])
predictions = lr.predict(X_test)

## Create a submission file:
We create a submission file that will be submitted to the kaggle competition or used for other purposes. We are creating a pandas dataframe with the id column from the test data and the predicted target column from the logistic regression model. Finally, we are saving this dataframe to a CSV file.

In [10]:
submission_df = pd.DataFrame({
    "id": test_df["id"],
    "target": predictions
})
submission_df.to_csv("data/submission.csv", index=False)
