# Private test not that **private** afterall 🙈


Having access to any information regarding the private test dataset is always useful. In this competition even more since [semi-supervised learning](https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/143094) may play a decisive role.

After some Googling, I came across the (probably) initial dataset that has been used by the author of this (atypical) Kaggle challenge.

In this notebook, starting from the found dataset, I will attempt to create the **private test dataset** and will propose some ideas on how we can use this data to enhance our models.

**Clean data**

I believe that this is quite an important discovery, as the found dataset **has not been processed**. In the Exploratory Data Analysis part, we will see how salient information such as hashtags and uncensured words are present in that dataset. Also, and maybe even more important, all original 13 sentiments are still present (empty, sadness, enthusiasm, neutral, worry, surprise, love, fun, hate, happiness, boredom, relief, anger).

**💡Ideas**

Among others, one of the main advantages we can have by using this dataset is to find a subset of the training data that better match the private dataset set and try to *overfit* a model to that data.

Also, as we will see later, the extra informations such as hahtags, tweet authors and sentiment can be used during pre and post-processing, as they play an important role in this challenge.

**Data leakage**

As some of you correctly specified, we cannot says that this is _data leakage_ as the original dataset is mentioned in the challenge description: _The dataset is titled Sentiment Analysis: Emotion in Text tweets._ Nontheless, the link (https://www.figure-eight.com/data-for-everyone/) they refers to is broken.

The dataset is also available here in Kaggle: https://www.kaggle.com/icw123/emotion 

### Import ftfy and other libraries

In this notebook I will use `ftfy`. I just discovered this python package some days ago and I would say it's exceptional! If you, like me, do not know it, you can check it out there [@LuminosoInsight/python-ftfy](https://github.com/LuminosoInsight/python-ftfy).

Given any string, such as `Kaggle is a cool placee &lt;3`, `ftfy.fix_text()` almost magically returns:

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

tqdm.pandas()

import ftfy
ftfy.fix_text('Kaggle is a cool placee &lt;3')

# Original dataset and EDA

Disclaimer: even if I call this dataframe `original`, I do not assure that it's exactly the one that the Challenge's authors used. For consistency, I will use the name for the rest of the notebook and I will try to prove that we can extract a subset of "private train data", but clearly I cannot assure anything. 

The dataframe can be found at the [following github link](https://raw.githubusercontent.com/Galanopoulog/DATA607-Project-4/master/TextEmotion.csv). If you search for some of the `train_df` and `test_df` tweets, you will shortly find a correspondency.

Also, this [Github README](https://github.com/sarnthil/unify-emotion-datasets/tree/master/datasets) provides some additional information regarding the dataset. As we already know, it has been released by an AI company called _eightfigure_. The same README has an "official" download link from the _eightfigure_ website (http://www.crowdflower.com/wp-content/uploads/2016/07/text_emotion.csv) but the link is broken (for your reference: recently _eightfigure_ has been bought by another company and some links were lost)

In [None]:
original_df = pd.read_csv("https://raw.githubusercontent.com/Galanopoulog/DATA607-Project-4/master/TextEmotion.csv")
original_df.head()

### Single tweet

Let's take at random a tweet from the TSE `train_df`: "sooo sad i will miss you here in san diego!!!" (with `textID` 549e992a42)

As we can see, the `original_df` contains the span text:

In [None]:
tweet = "sooo sad i will miss you here in san diego!!!"
original_df[original_df['content'].str.lower().str.contains(tweet)]

Even more interesting, we know who is the tweet `author`, hidalgoal, we have access to the hashtag, @danecook, and we have granular information information regarding the `sentiment`, sadness.

### Sentiment feature

In this dataset, there are **13** sentiments: 

In [None]:
len(original_df['sentiment'].unique())

In [None]:
list(original_df['sentiment'].unique())

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,5)

%config InlineBackend.figure_format='retina'

title = "Sentiment distribution in original_df"
original_df.groupby('sentiment')['content'].count().plot.bar(color='orange', title=title);

`happiness`, `neutral`, `sadness` and `worry` are the most used sentiment. We can assume that `happiness` was mapped to `positive`, whereas `sadness` and `worry` were mapped to `negative`.

**Key takeaways**

We can notice how the distribution between the sentiment is different. We may, for instance, discover that our model has difficulties to find the _support phrase_ for sentiment fewer commons, such as `boredom` or `relief`. In this case, we may want to do some pre-processing.

Also, in the official `train_df` the jaccard score for the `negative` sentiment is about 0.97. In the next versions, I will try to map each tweet to the more granular sentiment and will analyze the jaccard for each column.

### Tweet authors

Among 40k different tweets, there are 33871 different authors. This fact may also be exploited to increase the score. Note also that since the data have been randomly split into train/public test/private test, it may happens that tweets from the same authors **are present both in train and test datasets**. 

In [None]:
len(original_df['author'].unique())

_MissxMarisa_ is the most active users with 23 tweets, followed by _ChineseLearn_, _erkagarcia_ and _MiDesfileNegro_.

In [None]:
original_df['author'].value_counts()

# Construction of private test set

The `original_df` has exactly 40 thousand tweets, by simply subtracting the size of the train and test TSE datasets, we can find the private test dataset size. 

In [None]:
TSE_DATA = "/kaggle/input/tweet-sentiment-extraction/"

train_df = pd.read_csv(TSE_DATA + "train.csv").dropna().reset_index(drop=True)
test_df = pd.read_csv(TSE_DATA + "test.csv")

In [None]:
size_private_df = original_df.shape[0] - train_df.shape[0] - test_df.shape[0]
size_private_df

[](http://)From the TSE leaderboard we discover that **leaderboard [..] approximately 30% of the test data. The final results will be based on the other 70%**. By a simple calculus, the expected size of the private leaderboard is 8246, 700 less than our estimation. Probably, this is due to the fact that some of the tweets were removed during the [Leaderboard update coming!](https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/142073).

In [None]:
test_df.shape[0] / 30 * 70

To construct the private_test_set I tried many different things and algorithms. Finally, I picked a very simple solution that nonetheless shows great results.

We iterate over each tweet in the original_df and verify if it is present in train_df or test_df. But, since the train_df/test_df tweets have been cleaned further, we check whether the tweet of train_df is contained in the original dataframe and not the other way around.

As you can see, the private dataset is composed of about 8000 items; a good indication that we are on the right track.

In [None]:
train_test_tweets = list(train_df['text'].str.lower()) + list(test_df['text'].str.lower())

def tweet_in_private(content):
    for tweet in train_test_tweets:
        if tweet in content:
            return False
    return True

original_df['content'] = original_df['content'].str.lower()
original_df['in_private'] = original_df['content'].progress_apply(tweet_in_private)

In [None]:
original_df['in_private'].value_counts()

# Private dataset and EDA

Let's analyze at our findings ...

In [None]:
private_df = original_df[original_df['in_private'] == True]
private_df.head()

In [None]:
private_df.shape

In [None]:
title = "Sentiment distribution in private_df"
private_df.groupby('sentiment')['content'].count().plot.bar(color='orange', title=title);

It seems that the `sentiment` distributions follows somehow the distribution of the entire dataset. Nice.

In [None]:
private_df['author'].value_counts()

We save the private_df in a CSV for further analysis! Now it's up to you ;)

In [None]:
private_df.to_csv("test_private_df.csv", index=False)

# Conclusions and next steps

There are much more that can be done, but some groundworks have been done. In the next days, I will investigate further the creation of the dataset, for instance by trying to map all tweets and try to construct a new model with the new features.

Please, let me know your opinions and share your advice on how to improve this notebook. I put a lot of effort into it and hope you appreciated.

Thank you for reading; I hope you learned something !! 🤗