# Classification

## 1: Logistic Regression 

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

In [None]:
np.random.seed = 72

### Load data

For the first part we use the _income classification_ dataset from [kaggle](https://www.kaggle.com/lodetomasi1995/income-classification). This dataset contains some information about individual people and whether they earn more than $50K or not. Some of the features in the dataset are:

> 1. `age` : age of the person.
> 2. `workclass`: for which sector does the person work for, e.g, private sector, state
> 3. `education`: the last degree the person has received.
> 4. `occupation`: type of the job, e.g, services, sales, armed forces, etc.
> 5. `race`
> 6. `sex`
> 7. `native-country`

In [None]:
df = pd.read_csv("data/incom_classification.csv")
df.head()

How many rows and columns does this dataset have?


### base rate
What is the base rate?

### Important! 

__For all the questions below, fix the seed of random generators to 72.__

### Training

Train a logistic regression model on this data-set. Use all of the features in the dataset. For the categorical features, encode `education`, `occupation`, and `native-country` with label encoding and the rest with one hot encoding. Split the dataset into 80% training and 20% test set.

- what is the train accuracy?

- what is the test accuracy?

- what is the precision and recall for predicting the income class <=50K?

In [None]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

#### Numerical columns

In [None]:
# extracting numerial attributes


#### Categorical columns

Encode the columns `workclass`, `race`, and `sex` using one-hot encoding.

Encode the columns `education`, `occupation`, and `native-country` using label encoding.

Also encode the labels vector `y` to have \[0, 1\] labels instead of \[<=50K, >50K\].

Now concatonate all these features (numerical, label encoded, and one-hot encoded) into a single dataframe. You can use `pd.concat` function.

#### Train/test splitting
Now split the data into 80% training and 20% test set. Remember to set the random seed to 72.

#### Standardization
Standardize the numerical features to have mean zero and standard deviation equal to 1. You can use sklearn `StandardScaler` function.

#### Training
Finally, train a Logistic Regression model on the processed dataset you just created. Use the following attributes for Logistic Regression.

```
LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)
```

The module `LogisticRegressionCV` enabels you to train a Logistic Regression model with cross validation. That is, it uses a logistic regression model with L2 regularizer and finds the coefficient of the regularizer (which is the hyper-parameter of the model) by doing cross validation. The attribute `cv` determines how many folds it uses for cross validation. By default it searches for the hyper-parameter in a list of 10 numbers between $10^{-4}$ and $10^4$ (in a logarithmic scale). As you know, using a regularized model improves the generalization ability of your model, in other words, it improves the test accuracy.

#### Accuracy

Now compute the confusion matrix of your classifer for the test data.

## 2: Text Analytics

In [None]:
import pandas as pd
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import spacy

In [None]:
!python -m spacy download en

#### Load the data

For this part we use the _twitter climate change sentiment dataset_ from [kaggle](https://www.kaggle.com/edqian/twitter-climate-change-sentiment-dataset). The dataset contains tweets related to the climate change topic. Each tweet is labeled as one of the following classes:

- `2`(News): the tweet links to factual news about climate change

- `1`(Pro): the tweet supports the belief of man-made climate change

- `0`(Neutral): the tweet neither supports nor refutes the belief of man-made climate change

- `-1`(Anti): the tweet does not believe in man-made climate change


Your task is to predict the sentiment of these tweets using the text analytics techniques you have learned in the lab and a logistic regression model.

In [None]:
df = pd.read_csv("data/twitter_sentiment_data.csv")
df.head()

#### base rate
What is the base rate for this problem?

#### processing the tweets

preprocess the tweets:
- remove the stopwords

- remove the punctuation marks

- lowercase all of the words

- lemmatize all of the words

#### Train/test splitting
Split the dataset into 80% training and 20% test set. Remeber to set the random seed to be 72.

#### TF-IDF feature vectors

create the TF-IDF feature vectors for the processed tweetes. These will construct you data features that you will use to train a classifier.

#### Training

Now train a logistic regression classifier on the TF-IDF vectors. Use the `LogisticRegression` module (without regularizer) from sklearn with the following attributes:

```
LogisticRegression(solver="lbfgs", max_iter=1000, random_state=72)
```

We encourage you to make a pipeline that first vectorize the input text and then applies the classifier on the TF-IDF vectors. To do this you can use `Pipeline` from `sklearn.pipeline`. 

#### Accuracy

- What is the test accuracy of the classifier?

Compute the confusion matrix for this classifier.

#### Improving your classifier!

What could you do more to improve the test accuracy of your classifier? Here's some suggestions:

- Use regularized logistic regression and tune the hyper-parameter with cross-validation.

- Apply further text preprocessing, e.g, removing the retweets in the form of RT @<user>, removing hashtags, removing duplicate tweets (if any), etc.