# CAPSTONE PROJECT - CHATBOT

In [1]:
# !pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [2]:
# !pip install scikit-learn

In [3]:
# Check if GPU is available
from tensorflow.python.client import device_lib

def get_gpu_details():
    devices = device_lib.list_local_devices()
    for device in devices:
        if device.device_type == 'GPU':
            print(f"Device Name: {device.name}")
            print(f"Memory Limit: {device.memory_limit} bytes")
            print(f"Description: {device.physical_device_desc}")

get_gpu_details()


Device Name: /device:GPU:0
Memory Limit: 4158652416 bytes
Description: device: 0, name: NVIDIA GeForce GTX 1660 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5


## DATA
### Data sources
https://convokit.cornell.edu/documentation/movie.html <br>
https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html 

### Install ConvoKit

In [4]:
# !pip install convokit

### Load data from source and save to 'data' folder

In [5]:
# from convokit import Corpus, download
# import os

# # Directory where to save the corpus
# data_dir = os.path.join(os.getcwd(), 'data')

# # Ensure the directory exists
# if not os.path.exists(data_dir):
#     os.makedirs(data_dir)

# # Downloading and saving the corpus
# corpus = Corpus(filename=download("movie-corpus", data_dir=data_dir))

# # Saving the corpus to the 'data' folder
# corpus_path = os.path.join(data_dir, "movie_corpus")
# corpus.dump(corpus_path)

Downloading movie-corpus to C:\Users\tomui\Desktop\capstone_project\data\movie-corpus  
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done

### Load data from 'data' folder

In [6]:
from convokit import Corpus
import os

# Directory where to load the corpus
data_dir = os.path.join(os.getcwd(), 'data')

# Load the corpus from the specified folder
loaded_corpus = Corpus(filename=os.path.join(data_dir, "movie_corpus"))
loaded_corpus.print_summary_stats()

Number of Speakers: 9035
Number of Utterances: 304713
Number of Conversations: 83097


In [7]:
type(loaded_corpus)

convokit.model.corpus.Corpus

### Data Structure and Organization
```plaintext
data/
└── movie_corpus/
    ├── conversations.json
    ├── corpus.json
    ├── index.json
    ├── speakers.json
    └── utterances.jsonl
```

Description [here](https://convokit.cornell.edu/documentation/movie.html).


### Choice for exploration
The files I need from ConvoKit corpus for my chatbot project depend on the specific functionalities I want to implement in my chatbot. I'll most likely need `utterances.json` because it contains the dialogue data. This is what I'll use to train chatbot to understand and generate human-like responses.

Description from source:  
> "Utterance-level information <br>
> For each utterance, we provide:
> - id: index of the utterance
> - speaker: the speaker who authored the utterance
> - conversation_id: id of the first utterance in the conversation this utterance belongs to
> - reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)
> - timestamp: time of the utterance
> - text: textual content of the utterance
> 
> Metadata for utterances include:
> - movie_idx: index of the movie from which this utterance occurs
> - parsed: parsed version of the utterance text, represented as a SpaCy Doc"


### Understanding data from `utterances.jsonl`

In [8]:
import json
from pprint import pprint as pp

# Initialize a list to hold all the utterances
utterances = []

# Open the file and read line by line
with open(os.path.join(data_dir, 'movie_corpus', 'utterances.jsonl'), 'r') as file:
    
    for line in file:
        utterance = json.loads(line)
        utterances.append(utterance)

In [9]:
print(type(utterances))

<class 'list'>


In [10]:
print(f'There are a total of {len(utterances)} lines\n')
pp(utterances[: 3])

There are a total of 304713 lines

[{'conversation_id': 'L1044',
  'id': 'L1045',
  'meta': {'movie_id': 'm0',
           'parsed': [{'rt': 1,
                       'toks': [{'dep': 'nsubj',
                                 'dn': [],
                                 'tag': 'PRP',
                                 'tok': 'They',
                                 'up': 1},
                                {'dep': 'ROOT',
                                 'dn': [0, 2, 3],
                                 'tag': 'VBP',
                                 'tok': 'do'},
                                {'dep': 'neg',
                                 'dn': [],
                                 'tag': 'RB',
                                 'tok': 'not',
                                 'up': 1},
                                {'dep': 'punct',
                                 'dn': [],
                                 'tag': '.',
                                 'tok': '!',
                           

### Understanding data from other dataset json files

In [11]:
# Load the data
# with open(os.path.join(data_dir, 'movie_corpus', 'conversations.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'corpus.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'index.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'speakers.json'), 'r') as file:
#     conversations = json.load(file)

# print(type(conversations))

# print(f'There are a total of {len(conversations)} keys in the dictionary\n')
# first_three_items = list(conversations.items())[:3]
# pp(first_three_items)

## Decision regarding data

In developing the chatbot I made the decision to collect only data from the utterances.json file to ensure the chatbot can effectively manage and understand multi-turn conversations. The essential data elements to be gathered include `'text'` for generating responses, `'conversation_id'` for tracking the flow of conversations, and `'reply_to'` for understanding response sequences within the dialogue. While initially, the chatbot will not utilize complex NLP features like parsed linguistic data, the architecture will allow for the integration of these advanced features in the future. While initially I will collect `'parsed'` and `'toks'` information from the utterances.json file, the decision on whether to use this pre-parsed data directly, generate similar data anew, or conduct comparisons between the two will be made later as the project evolves. This approach ensures flexibility in utilizing advanced NLP features as required, maintaining the adaptability of the architecture for future enhancements.

## Converting utterances data to DataFrame
Pandas provides a powerful and easy-to-use interface for data manipulation, filtering, transformation, and analysis, and integration with Python Ecosystem: seamless integration with other Python libraries for data analysis, machine learning (e.g., scikit-learn, TensorFlow), and visualization (e.g., Matplotlib, Seaborn), as well fast processing for datasets that fit comfortably in memory.

In [12]:
import numpy as np
import pandas as pd

# Flatten the data
def flatten_data(data):
    flattened_data = []
    for entry in data:
        flat_entry = {
            'id': entry['id'],
            'conversation_id': entry['conversation_id'],
            'text': entry['text'],
            'speaker': entry['speaker'],
            'reply_to': entry.get('reply-to'),
            'timestamp': entry['timestamp'],
            'movie_id': entry['meta']['movie_id'],
        }
        # Handle nested parsed data
        for parsed in entry['meta']['parsed']:
            for idx, tok in enumerate(parsed['toks']):
                flat_entry[f'tok_{idx}_token'] = tok['tok']
                flat_entry[f'tok_{idx}_tag'] = tok['tag']
                flat_entry[f'tok_{idx}_dep'] = tok['dep']
                # Add other fields from tokens as needed
        flattened_data.append(flat_entry)
    return flattened_data

# Convert to DataFrame
flattened_data = flatten_data(utterances)
df = pd.DataFrame(flattened_data)

In [13]:
# Show DataFrame to check structure
df.head()

Unnamed: 0,id,conversation_id,text,speaker,reply_to,timestamp,movie_id,tok_0_token,tok_0_tag,tok_0_dep,...,tok_121_dep,tok_122_token,tok_122_tag,tok_122_dep,tok_123_token,tok_123_tag,tok_123_dep,tok_124_token,tok_124_tag,tok_124_dep
0,L1045,L1044,They do not!,u0,L1044,,m0,They,PRP,nsubj,...,,,,,,,,,,
1,L1044,L1044,They do to!,u2,,,m0,They,PRP,nsubj,...,,,,,,,,,,
2,L985,L984,I hope so.,u0,L984,,m0,I,PRP,nsubj,...,,,,,,,,,,
3,L984,L984,She okay?,u2,,,m0,She,PRP,nsubj,...,,,,,,,,,,
4,L925,L924,Let's go.,u0,L924,,m0,Let,VB,ROOT,...,,,,,,,,,,


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Columns: 382 entries, id to tok_124_dep
dtypes: object(382)
memory usage: 888.1+ MB


In [15]:
# Temporarily adjust display settings to show all columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df.isnull().sum())
#print(df.isnull().sum())

id                      0
conversation_id         0
text                    0
speaker                 0
reply_to            83097
timestamp          304713
movie_id                0
tok_0_token           267
tok_0_tag             267
tok_0_dep             267
tok_1_token           625
tok_1_tag             625
tok_1_dep             625
tok_2_token         21729
tok_2_tag           21729
tok_2_dep           21729
tok_3_token         38515
tok_3_tag           38515
tok_3_dep           38515
tok_4_token         63316
tok_4_tag           63316
tok_4_dep           63316
tok_5_token         92404
tok_5_tag           92404
tok_5_dep           92404
tok_6_token        121962
tok_6_tag          121962
tok_6_dep          121962
tok_7_token        149962
tok_7_tag          149962
tok_7_dep          149962
tok_8_token        175281
tok_8_tag          175281
tok_8_dep          175281
tok_9_token        196627
tok_9_tag          196627
tok_9_dep          196627
tok_10_token       214433
tok_10_tag  

## Saving the DataFrame

In [16]:
# !pip install pyarrow

In [17]:
# !pip install fastparquet

In [18]:
# Saving the DataFrame
file_path_parquet = os.path.join(data_dir, 'utterances.parquet')
df.to_parquet(file_path_parquet)

`utterances.jsonl` - 351 404 KB, `utterances.parquet` - 28 409 KB

## Loading the DataFrame

In [19]:
# Loading the DataFrame
file_path_parquet = os.path.join(data_dir, 'utterances.parquet')
df_loaded_parquet = pd.read_parquet(file_path_parquet)

df_loaded_parquet.head(20)

Unnamed: 0,id,conversation_id,text,speaker,reply_to,timestamp,movie_id,tok_0_token,tok_0_tag,tok_0_dep,...,tok_121_dep,tok_122_token,tok_122_tag,tok_122_dep,tok_123_token,tok_123_tag,tok_123_dep,tok_124_token,tok_124_tag,tok_124_dep
0,L1045,L1044,They do not!,u0,L1044,,m0,They,PRP,nsubj,...,,,,,,,,,,
1,L1044,L1044,They do to!,u2,,,m0,They,PRP,nsubj,...,,,,,,,,,,
2,L985,L984,I hope so.,u0,L984,,m0,I,PRP,nsubj,...,,,,,,,,,,
3,L984,L984,She okay?,u2,,,m0,She,PRP,nsubj,...,,,,,,,,,,
4,L925,L924,Let's go.,u0,L924,,m0,Let,VB,ROOT,...,,,,,,,,,,
5,L924,L924,Wow,u2,,,m0,Wow,UH,ROOT,...,,,,,,,,,,
6,L872,L870,Okay -- you're gonna need to learn how to lie.,u0,L871,,m0,Okay,UH,intj,...,,,,,,,,,,
7,L871,L870,No,u2,L870,,m0,No,UH,ROOT,...,,,,,,,,,,
8,L870,L870,I'm kidding. You know how sometimes you just ...,u0,,,m0,And,CC,cc,...,,,,,,,,,,
9,L869,L866,Like my fear of wearing pastels?,u0,L868,,m0,Like,IN,ROOT,...,,,,,,,,,,


## Data cleaning

### Leaving only necessary data for initial stage of the project
Id, conversation_id for tracking the flow of conversations and reply_to for understanding the sequence within the dialogue, and conversation text ofcourse.

In [20]:
conversations = df_loaded_parquet[['text', 'id', 'conversation_id', 'reply_to']]
conversations.head(30)

Unnamed: 0,text,id,conversation_id,reply_to
0,They do not!,L1045,L1044,L1044
1,They do to!,L1044,L1044,
2,I hope so.,L985,L984,L984
3,She okay?,L984,L984,
4,Let's go.,L925,L924,L924
5,Wow,L924,L924,
6,Okay -- you're gonna need to learn how to lie.,L872,L870,L871
7,No,L871,L870,L870
8,I'm kidding. You know how sometimes you just ...,L870,L870,
9,Like my fear of wearing pastels?,L869,L866,L868


In [21]:
# Cheking if counts of None are the same with 'id' == 'conversation_id'
print("Total entries where 'id' equals 'conversation_id':", (df['id'] == df['conversation_id']).sum())
print("Total entries where 'reply_to' is None:", df['reply_to'].isnull().sum())

Total entries where 'id' equals 'conversation_id': 83097
Total entries where 'reply_to' is None: 83097


## Conclusion
Looks like the dataset is well-structured and prepared for further processing: <br>
Conversation_id and id:  <br>
When conversation_id and id are the same and there's no reply_to, this indicates the start of a new conversation, this allows to understand where each conversation begins.  <br>
Counts of None in reply_to:  <br>
The count of None in the reply_to field matches the number of conversations (83,097). This confirms that each conversation starts with a message that does not reply to any previous message, this is the first message in the thread.  <br>
Data Cleanliness:  <br>
The alignment of these counts and the consistency of data formatting suggest that dataset is clean and structured. Each message within the dataset is correctly linked to its conversation, and the flow of conversations is well-defined.  <br>

In [22]:
conversations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   text             304713 non-null  object
 1   id               304713 non-null  object
 2   conversation_id  304713 non-null  object
 3   reply_to         221616 non-null  object
dtypes: object(4)
memory usage: 9.3+ MB


In [23]:
conversations.text.describe()

count     304713
unique    265774
top        What?
freq        1684
Name: text, dtype: object

## Analyze text of conversations

In [24]:
# !pip install textblob

In [25]:
from nltk import FreqDist, word_tokenize
from nltk.tokenize import sent_tokenize
from textblob import TextBlob
import nltk

# Ensure that the punkt tokenizer is available
nltk.download('punkt')

# Basic statistics
conversations.loc[:, 'msg_length'] = conversations.loc[:, 'text'].apply(len)
conversations.loc[:, 'word_count'] = conversations.loc[:, 'text'].apply(lambda x: len(word_tokenize(x)))


print("Average message length (characters):", np.mean(conversations['msg_length']))
print("Average message length (words):", np.mean(conversations['word_count']))
print("Min message length (characters):", np.min(conversations['msg_length']))
print("Max message length (characters):", np.max(conversations['msg_length']))
print("Standard deviation (characters):", np.std(conversations['msg_length']))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations.loc[:, 'msg_length'] = conversations.loc[:, 'text'].apply(len)


Average message length (characters): 55.25953930419772
Average message length (words): 13.721094931952361
Min message length (characters): 0
Max message length (characters): 3046
Standard deviation (characters): 64.06661834805733


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations.loc[:, 'word_count'] = conversations.loc[:, 'text'].apply(lambda x: len(word_tokenize(x)))


### Word frequency analysis

In [26]:
all_words = ' '.join(conversations['text']).lower()
words = word_tokenize(all_words)
freq_dist = FreqDist(words)
print("Most common words:", freq_dist.most_common(50))

Most common words: [('.', 332912), (',', 170188), ('you', 148400), ('i', 140952), ('?', 110240), ('the', 99132), ('to', 80649), ('a', 70839), ("'s", 66538), ('it', 66076), ("n't", 55224), ('...', 50796), ('do', 47049), ('that', 46582), ('and', 45934), ('of', 39338), ('!', 37866), ('what', 37719), ('in', 34129), ('me', 32203), ('is', 31639), ('we', 29291), ('he', 27408), ('--', 26662), ('this', 24616), ('for', 23415), ('have', 22934), ("'m", 22578), ("'re", 21717), ('know', 21657), ('was', 21407), ('your', 20962), ('my', 20824), ('not', 19883), ('on', 19560), ('no', 19425), ('be', 19414), ('are', 17600), ('but', 17321), ('with', 17249), ('they', 16942), ('just', 15853), ('all', 15392), ('like', 15007), ("'ll", 14613), ('did', 14547), ('there', 14446), ('get', 14152), ('about', 14000), ('so', 13447)]


### Sentiment analysis

In [27]:
conversations.loc[:, 'sentiment'] = conversations.loc[:, 'text'].apply(lambda x: TextBlob(x).sentiment.polarity)
print("Average sentiment (polarity):", np.mean(conversations['sentiment']))
print("Sentiment distribution:", conversations['sentiment'].describe())

Average sentiment (polarity): 0.04174547982992158
Sentiment distribution: count    304713.000000
mean          0.041745
std           0.246197
min          -1.000000
25%           0.000000
50%           0.000000
75%           0.013889
max           1.000000
Name: sentiment, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations.loc[:, 'sentiment'] = conversations.loc[:, 'text'].apply(lambda x: TextBlob(x).sentiment.polarity)


In [28]:
# Calculating zero characters messages
zero_length_messages = conversations[conversations['text'].apply(len) == 0]
print("Number of zero-length messages:", len(zero_length_messages))
zero_length_messages.sample(10)

Number of zero-length messages: 267


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment
154381,,L129216,L129216,,0,0,0.0
153316,,L129434,L129427,L129433,0,0,0.0
289507,,L624042,L624042,,0,0,0.0
153885,,L128055,L128044,L128054,0,0,0.0
154115,,L128754,L128752,L128753,0,0,0.0
154231,,L129490,L129485,L129489,0,0,0.0
98568,,L535288,L535287,L535287,0,0,0.0
153454,,L128976,L128976,,0,0,0.0
153595,,L128655,L128650,L128654,0,0,0.0
153343,,L129381,L129381,,0,0,0.0


Same id and conversation_id with None in reply_to - these messages likely represent the start of a conversation. Removing them could impact the structure of the conversation as it might remove the entry point for a conversational thread.
Different id and conversation_id with a specific reply_to - these are responses within a conversation. Their removal might disrupt the sequence, making it difficult to follow the flow of the conversation.
Messages with a specific reply_to - these indicate replies within the conversation sequence. Removing these could create gaps in the conversation history. I've decided to leave zero text conversations for now, besides thera are only 267 of these.

In [29]:
# Calculating long messages
long_messages = conversations[conversations['msg_length'] > 300]
print("Number of long messages:", len(long_messages))


Number of long messages: 3151


In [30]:
# Print sample long messages
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

Reinforced steel core walls.  Buried phone line, completely separate, not connected to the house's main line and never exposed throughout the house's infrastructure or outside the house -- you can call the police; nobody can cut you off. Your own ventilation system, complete with oxygen scrubber, so you've got plenty of fresh air for as long as you like.  And a bank of video monitors --


Decided to leave for now long messages - considering to use advanced NLP models such as BERT or GPT (from the transformer family), which are adept at understanding context over longer stretches of text.

## Conversations text preprocessing

### Remove punctuation

In [31]:
import string
conversations.loc[:, 'text'] = conversations.loc[:, 'text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

She starts givin me some bullshit about it aint there Its somewhere else and we can go get it  Im shootin you in the head right then and there Then Im gonna shoot her in the kneecap find out where my godamn money is I go walkin in there and that nigga Winston or anybody else is in there youre the first man shot understand what Im sayin


### Normalize text

In [32]:
import re
import unicodedata

# Convert to lowercase
conversations.loc[:, 'text'] = conversations.loc[:, 'text'].str.lower().str.strip()
# Function to apply the regex and normalization transformations row-wise
def normalize_text(text):
    # Remove non-alphanumeric characters except for basic punctuation
    text = re.sub(r"[^a-z0-9.',!? ]", ' ', text)
    # Replace numbers with a special token
    text = re.sub(r'\d+', '<num>', text)
    # Normalize accented characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    return text

# Apply the normalization function to each row in the 'text' column
conversations.loc[:, 'text'] = conversations.loc[:, 'text'].apply(normalize_text)
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

no no please it is not a holy relic  you know we have met already in this very room perhaps you wont remember it you were only six years old  he was giving the most brilliant little concert here as he got off the stool he slipped and fell my sister antoinette helped him up herself and do you know what he did jumped straight into her arms and said will you marry me yes or no


### Remove Stopwords

In [33]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
conversations.loc[:, 'text'] = conversations['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


theres nothing brain dead cunt think theres absolutely way world connect us anything want hang phone call im going send friend mine crack open fucking rib spreader


## Analyze again text of conversations

In [34]:
# Basic statistics
conversations.loc[:, 'msg_length'] = conversations['text'].apply(len)
conversations.loc[:, 'word_count'] = conversations['text'].apply(lambda x: len(word_tokenize(x)))

# Calculate basic statistics
average_msg_length_chars = np.mean(conversations['msg_length'])
average_msg_length_words = np.mean(conversations['word_count'])
min_msg_length_chars = np.min(conversations['msg_length'])
max_msg_length_chars = np.max(conversations['msg_length'])
std_dev_msg_length_chars = np.std(conversations['msg_length'])

# Calculate 25th and 75th percentiles
percentile_25_chars = conversations['msg_length'].quantile(0.25)
percentile_75_chars = conversations['msg_length'].quantile(0.75)
percentile_25_words = conversations['word_count'].quantile(0.25)
percentile_75_words = conversations['word_count'].quantile(0.75)

# Print statistics
print("Average message length (characters):", average_msg_length_chars)
print("Average message length (words):", average_msg_length_words)
print("Min message length (characters):", min_msg_length_chars)
print("Max message length (characters):", max_msg_length_chars)
print("Standard deviation (characters):", std_dev_msg_length_chars)
print("25th percentile (characters):", percentile_25_chars)
print("75th percentile (characters):", percentile_75_chars)
print("25th percentile (words):", percentile_25_words)
print("75th percentile (words):", percentile_75_words)

Average message length (characters): 31.98917670069869
Average message length (words): 5.393163402939815
Min message length (characters): 0
Max message length (characters): 1836
Standard deviation (characters): 38.791295277648075
25th percentile (characters): 10.0
75th percentile (characters): 40.0
25th percentile (words): 2.0
75th percentile (words): 7.0


### Tokenize text

In [35]:
conversations.loc[:, 'tokens'] = conversations.loc[:, 'text'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations.loc[:, 'tokens'] = conversations.loc[:, 'text'].apply(word_tokenize)


In [36]:
conversations[1200: 1210]

Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
1200,czechoslovakia slavs fighting germans russians...,L2897,L2895,L2896,99,12,0.13,"[czechoslovakia, slavs, fighting, germans, rus..."
1201,eastern europe like romania hungary,L2896,L2895,L2895,35,5,0.0,"[eastern, europe, like, romania, hungary]"
1202,maybe ritual thing someone trying send message...,L2895,L2895,,150,21,-0.216667,"[maybe, ritual, thing, someone, trying, send, ..."
1203,look im even sure anything saw outside fire th...,L2893,L2892,L2892,106,17,0.25,"[look, im, even, sure, anything, saw, outside,..."
1204,would call,L2892,L2892,,10,2,0.0,"[would, call]"
1205,says shes suspect,L2891,L2887,L2890,17,3,0.0,"[says, shes, suspect]"
1206,maybe dont care either prettiest suspect ive a...,L2890,L2887,L2889,51,8,0.0,"[maybe, dont, care, either, prettiest, suspect..."
1207,hmmmm,L2889,L2887,L2888,5,1,0.0,[hmmmm]
1208,pretty,L2888,L2887,L2887,6,1,0.25,[pretty]
1209,super said hed seen didnt live,L2887,L2887,,30,6,0.234848,"[super, said, hed, seen, didnt, live]"


### Lemmatize tokens

In [37]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
conversations.loc[:, 'tokens'] = conversations['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
conversations[1200: 1210]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
1200,czechoslovakia slavs fighting germans russians...,L2897,L2895,L2896,99,12,0.13,"[czechoslovakia, slav, fighting, german, russi..."
1201,eastern europe like romania hungary,L2896,L2895,L2895,35,5,0.0,"[eastern, europe, like, romania, hungary]"
1202,maybe ritual thing someone trying send message...,L2895,L2895,,150,21,-0.216667,"[maybe, ritual, thing, someone, trying, send, ..."
1203,look im even sure anything saw outside fire th...,L2893,L2892,L2892,106,17,0.25,"[look, im, even, sure, anything, saw, outside,..."
1204,would call,L2892,L2892,,10,2,0.0,"[would, call]"
1205,says shes suspect,L2891,L2887,L2890,17,3,0.0,"[say, shes, suspect]"
1206,maybe dont care either prettiest suspect ive a...,L2890,L2887,L2889,51,8,0.0,"[maybe, dont, care, either, prettiest, suspect..."
1207,hmmmm,L2889,L2887,L2888,5,1,0.0,[hmmmm]
1208,pretty,L2888,L2887,L2887,6,1,0.25,[pretty]
1209,super said hed seen didnt live,L2887,L2887,,30,6,0.234848,"[super, said, hed, seen, didnt, live]"


### Adding `<start>` and `<end>` tokens to lists of tokens

In [38]:
# Adding <start> and <end> tokens to each list in the 'tokens' column
conversations['tokens'] = conversations['tokens'].apply(lambda x: ['<start>'] + x + ['<end>'])
conversations[1200: 1210]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations['tokens'] = conversations['tokens'].apply(lambda x: ['<start>'] + x + ['<end>'])


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
1200,czechoslovakia slavs fighting germans russians...,L2897,L2895,L2896,99,12,0.13,"[<start>, czechoslovakia, slav, fighting, germ..."
1201,eastern europe like romania hungary,L2896,L2895,L2895,35,5,0.0,"[<start>, eastern, europe, like, romania, hung..."
1202,maybe ritual thing someone trying send message...,L2895,L2895,,150,21,-0.216667,"[<start>, maybe, ritual, thing, someone, tryin..."
1203,look im even sure anything saw outside fire th...,L2893,L2892,L2892,106,17,0.25,"[<start>, look, im, even, sure, anything, saw,..."
1204,would call,L2892,L2892,,10,2,0.0,"[<start>, would, call, <end>]"
1205,says shes suspect,L2891,L2887,L2890,17,3,0.0,"[<start>, say, shes, suspect, <end>]"
1206,maybe dont care either prettiest suspect ive a...,L2890,L2887,L2889,51,8,0.0,"[<start>, maybe, dont, care, either, prettiest..."
1207,hmmmm,L2889,L2887,L2888,5,1,0.0,"[<start>, hmmmm, <end>]"
1208,pretty,L2888,L2887,L2887,6,1,0.25,"[<start>, pretty, <end>]"
1209,super said hed seen didnt live,L2887,L2887,,30,6,0.234848,"[<start>, super, said, hed, seen, didnt, live,..."


## Learning conversation id structure more accurate

In [39]:
filtered_sorted_conversations = conversations[conversations['id'].apply(lambda x: 840 <= int(x[1:]) <= 870 if x[1:].isdigit() else False)
].sort_values(by='id', key=lambda x: x.str.extract('(\d+)', expand=False).astype(int))

filtered_sorted_conversations


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
428,youre amazingly selfassured anyone ever told,L840,L834,L839,44,6,0.6,"[<start>, youre, amazingly, selfassured, anyon..."
427,go prom,L841,L834,L840,7,2,0.0,"[<start>, go, prom, <end>]"
426,request command,L842,L842,,15,2,0.0,"[<start>, request, command, <end>]"
425,know mean,L843,L842,L842,9,2,-0.3125,"[<start>, know, mean, <end>]"
424,,L844,L842,L843,0,0,0.0,"[<start>, <end>]"
423,,L845,L842,L844,0,0,0.0,"[<start>, <end>]"
422,wont go,L846,L842,L845,7,2,0.0,"[<start>, wont, go, <end>]"
421,,L847,L842,L846,0,0,0.0,"[<start>, <end>]"
420,dont want stupid tradition,L848,L842,L847,26,4,-0.8,"[<start>, dont, want, stupid, tradition, <end>]"
419,create little drama start new rumor,L852,L852,,35,6,-0.025568,"[<start>, create, little, drama, start, new, r..."


id: Unique identifier for each message. <br>
conversation_id: Identifier for the conversation to which the message belongs. All messages within the same conversation share this ID. <br>
reply_to: ID of the message to which the current message is a response. If this is None, the message is the start of a conversation thread.

## Pairing messages - input with responses

In [40]:
# Merging the DataFrame with itself to form pairs
pairs = pd.merge(
    conversations, conversations,
    left_on='id',
    right_on='reply_to',
    suffixes=('_input', '_response')
)

In [41]:
pairs.head()

Unnamed: 0,text_input,id_input,conversation_id_input,reply_to_input,msg_length_input,word_count_input,sentiment_input,tokens_input,text_response,id_response,conversation_id_response,reply_to_response,msg_length_response,word_count_response,sentiment_response,tokens_response
0,,L1044,L1044,,0,0,0.0,"[<start>, <end>]",,L1045,L1044,L1044,0,0,0.0,"[<start>, <end>]"
1,okay,L984,L984,,4,1,0.5,"[<start>, okay, <end>]",hope,L985,L984,L984,4,1,0.0,"[<start>, hope, <end>]"
2,wow,L924,L924,,3,1,0.1,"[<start>, wow, <end>]",lets go,L925,L924,L924,7,2,0.0,"[<start>, let, go, <end>]"
3,,L871,L870,L870,0,0,0.0,"[<start>, <end>]",okay youre gonna need learn lie,L872,L870,L871,31,7,0.5,"[<start>, okay, youre, gon, na, need, learn, l..."
4,im kidding know sometimes become persona dont ...,L870,L870,,55,9,0.0,"[<start>, im, kidding, know, sometimes, become...",,L871,L870,L870,0,0,0.0,"[<start>, <end>]"


In [42]:
# # Selecting the needed columns including IDs
# training_data = pairs[['id_input', 'text_with_tokens_input', 'tokens_input', 'sentiment_input', 'id_response', 'text_with_tokens_response', 'tokens_response', 'sentiment_response']]

# # Renaming columns for clarity
# training_data.columns = ['ID_Input', 'Input', 'Tokens_Input', 'Sentiment_Input', 'ID_Response', 'Response', 'Tokens_Response', 'Sentiment_Response']

In [43]:
# Selecting the needed columns including IDs
training_data = pairs[['id_input', 'text_input', 'tokens_input', 'id_response', 'text_response', 'tokens_response']]

# Renaming columns for clarity
training_data.columns = ['ID_Input', 'Input', 'Tokens_Input', 'ID_Response', 'Response', 'Tokens_Response']

In [44]:
training_data

Unnamed: 0,ID_Input,Input,Tokens_Input,ID_Response,Response,Tokens_Response
0,L1044,,"[<start>, <end>]",L1045,,"[<start>, <end>]"
1,L984,okay,"[<start>, okay, <end>]",L985,hope,"[<start>, hope, <end>]"
2,L924,wow,"[<start>, wow, <end>]",L925,lets go,"[<start>, let, go, <end>]"
3,L871,,"[<start>, <end>]",L872,okay youre gonna need learn lie,"[<start>, okay, youre, gon, na, need, learn, l..."
4,L870,im kidding know sometimes become persona dont ...,"[<start>, im, kidding, know, sometimes, become...",L871,,"[<start>, <end>]"
...,...,...,...,...,...,...
221611,L666520,well assure sir desire create difficulties <num>,"[<start>, well, assure, sir, desire, create, d...",L666521,assure fact id obliged best advice scouts seen,"[<start>, assure, fact, id, obliged, best, adv..."
221612,L666371,lord chelmsford seems want stay back basutos,"[<start>, lord, chelmsford, seems, want, stay,...",L666372,think chelmsford wants good man border fears f...,"[<start>, think, chelmsford, want, good, man, ..."
221613,L666370,im take sikali main column river,"[<start>, im, take, sikali, main, column, rive...",L666371,lord chelmsford seems want stay back basutos,"[<start>, lord, chelmsford, seems, want, stay,..."
221614,L666369,orders mr vereker,"[<start>, order, mr, vereker, <end>]",L666370,im take sikali main column river,"[<start>, im, take, sikali, main, column, rive..."


In [45]:
# Checking how many pairs I shall get
len(conversations) - len(conversations.loc[:, 'conversation_id'].unique())

221616

In [46]:
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221616 entries, 0 to 221615
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   ID_Input         221616 non-null  object
 1   Input            221616 non-null  object
 2   Tokens_Input     221616 non-null  object
 3   ID_Response      221616 non-null  object
 4   Response         221616 non-null  object
 5   Tokens_Response  221616 non-null  object
dtypes: object(6)
memory usage: 10.1+ MB


### Checking token length distribution

In [47]:
token_lengths_input = training_data['Tokens_Input'].apply(len)
token_lengths_response = training_data['Tokens_Response'].apply(len)

print("Input Token Lengths - Statistics:")
print(token_lengths_input.describe())

print("\nResponse Token Lengths - Statistics:")
print(token_lengths_response.describe())


Input Token Lengths - Statistics:
count    221616.000000
mean          7.212381
std           5.866951
min           2.000000
25%           4.000000
50%           5.000000
75%           9.000000
max         155.000000
Name: Tokens_Input, dtype: float64

Response Token Lengths - Statistics:
count    221616.000000
mean          7.408829
std           6.223942
min           2.000000
25%           4.000000
50%           6.000000
75%           9.000000
max         274.000000
Name: Tokens_Response, dtype: float64


## Saving the DataFrame

In [48]:
file_path_parquet = os.path.join(data_dir, 'training_data.parquet')
training_data.to_parquet(file_path_parquet)

In [49]:
training_data

Unnamed: 0,ID_Input,Input,Tokens_Input,ID_Response,Response,Tokens_Response
0,L1044,,"[<start>, <end>]",L1045,,"[<start>, <end>]"
1,L984,okay,"[<start>, okay, <end>]",L985,hope,"[<start>, hope, <end>]"
2,L924,wow,"[<start>, wow, <end>]",L925,lets go,"[<start>, let, go, <end>]"
3,L871,,"[<start>, <end>]",L872,okay youre gonna need learn lie,"[<start>, okay, youre, gon, na, need, learn, l..."
4,L870,im kidding know sometimes become persona dont ...,"[<start>, im, kidding, know, sometimes, become...",L871,,"[<start>, <end>]"
...,...,...,...,...,...,...
221611,L666520,well assure sir desire create difficulties <num>,"[<start>, well, assure, sir, desire, create, d...",L666521,assure fact id obliged best advice scouts seen,"[<start>, assure, fact, id, obliged, best, adv..."
221612,L666371,lord chelmsford seems want stay back basutos,"[<start>, lord, chelmsford, seems, want, stay,...",L666372,think chelmsford wants good man border fears f...,"[<start>, think, chelmsford, want, good, man, ..."
221613,L666370,im take sikali main column river,"[<start>, im, take, sikali, main, column, rive...",L666371,lord chelmsford seems want stay back basutos,"[<start>, lord, chelmsford, seems, want, stay,..."
221614,L666369,orders mr vereker,"[<start>, order, mr, vereker, <end>]",L666370,im take sikali main column river,"[<start>, im, take, sikali, main, column, rive..."
