# CAPSTONE PROJECT - CHATBOT

## DATA
### Data sources
https://convokit.cornell.edu/documentation/movie.html <br>
https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html 

### Install ConvoKit

In [1]:
#!pip install convokit

### Load data from source and save to 'data' folder

In [2]:
# from convokit import Corpus, download
# import os

# # Directory where to save the corpus
# data_dir = os.path.join(os.getcwd(), 'data')

# # Ensure the directory exists
# if not os.path.exists(data_dir):
#     os.makedirs(data_dir)

# # Downloading and saving the corpus
# corpus = Corpus(filename=download("movie-corpus", data_dir=data_dir))

# # Saving the corpus to the 'data' folder
# corpus_path = os.path.join(data_dir, "movie_corpus")
# corpus.dump(corpus_path)

Downloading movie-corpus to C:\Users\tomui\Desktop\capstone_project\data\movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


Downloading movie-corpus to C:\Users\tomui\Desktop\capstone_project\data\movie-corpus  
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done

### Load data from 'data' folder

In [3]:
from convokit import Corpus
import os

# Directory where to load the corpus
data_dir = os.path.join(os.getcwd(), 'data')

# Load the corpus from the specified folder
loaded_corpus = Corpus(filename=os.path.join(data_dir, "movie_corpus"))
loaded_corpus.print_summary_stats()

Number of Speakers: 9035
Number of Utterances: 304713
Number of Conversations: 83097


In [4]:
type(loaded_corpus)

convokit.model.corpus.Corpus

### Data Structure and Organization
```plaintext
data/
└── movie_corpus/
    ├── conversations.json
    ├── corpus.json
    ├── index.json
    ├── speakers.json
    └── utterances.jsonl
```

Description [here](https://convokit.cornell.edu/documentation/movie.html).


### Choice for exploration
The files I need from ConvoKit corpus for my chatbot project depend on the specific functionalities I want to implement in my chatbot. I'll most likely need `utterances.json` because it contains the dialogue data. This is what I'll use to train chatbot to understand and generate human-like responses.

Description from source:  
> "Utterance-level information <br>
> For each utterance, we provide:
> - id: index of the utterance
> - speaker: the speaker who authored the utterance
> - conversation_id: id of the first utterance in the conversation this utterance belongs to
> - reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)
> - timestamp: time of the utterance
> - text: textual content of the utterance
> 
> Metadata for utterances include:
> - movie_idx: index of the movie from which this utterance occurs
> - parsed: parsed version of the utterance text, represented as a SpaCy Doc"


### Understanding data from `utterances.jsonl`

In [5]:
import json
from pprint import pprint as pp

# Initialize a list to hold all the utterances
utterances = []

# Open the file and read line by line
with open(os.path.join(data_dir, 'movie_corpus', 'utterances.jsonl'), 'r') as file:
    
    for line in file:
        utterance = json.loads(line)
        utterances.append(utterance)

In [6]:
print(type(utterances))

<class 'list'>


In [7]:
print(f'There are a total of {len(utterances)} lines\n')
pp(utterances[: 3])

There are a total of 304713 lines

[{'conversation_id': 'L1044',
  'id': 'L1045',
  'meta': {'movie_id': 'm0',
           'parsed': [{'rt': 1,
                       'toks': [{'dep': 'nsubj',
                                 'dn': [],
                                 'tag': 'PRP',
                                 'tok': 'They',
                                 'up': 1},
                                {'dep': 'ROOT',
                                 'dn': [0, 2, 3],
                                 'tag': 'VBP',
                                 'tok': 'do'},
                                {'dep': 'neg',
                                 'dn': [],
                                 'tag': 'RB',
                                 'tok': 'not',
                                 'up': 1},
                                {'dep': 'punct',
                                 'dn': [],
                                 'tag': '.',
                                 'tok': '!',
                           

### Understanding data from other dataset json files

In [8]:
# Load the data
# with open(os.path.join(data_dir, 'movie_corpus', 'conversations.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'corpus.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'index.json'), 'r') as file:
#     conversations = json.load(file)
# with open(os.path.join(data_dir, 'movie_corpus', 'speakers.json'), 'r') as file:
#     conversations = json.load(file)

# print(type(conversations))

# print(f'There are a total of {len(conversations)} keys in the dictionary\n')
# first_three_items = list(conversations.items())[:3]
# pp(first_three_items)

## Decision regarding data

In developing the chatbot I made the decision to collect only data from the utterances.json file to ensure the chatbot can effectively manage and understand multi-turn conversations. The essential data elements to be gathered include `'text'` for generating responses, `'conversation_id'` for tracking the flow of conversations, and `'reply_to'` for understanding response sequences within the dialogue. While initially, the chatbot will not utilize complex NLP features like parsed linguistic data, the architecture will allow for the integration of these advanced features in the future. While initially I will collect `'parsed'` and `'toks'` information from the utterances.json file, the decision on whether to use this pre-parsed data directly, generate similar data anew, or conduct comparisons between the two will be made later as the project evolves. This approach ensures flexibility in utilizing advanced NLP features as required, maintaining the adaptability of the architecture for future enhancements.

## Converting utterances data to DataFrame
Pandas provides a powerful and easy-to-use interface for data manipulation, filtering, transformation, and analysis, and integration with Python Ecosystem: seamless integration with other Python libraries for data analysis, machine learning (e.g., scikit-learn, TensorFlow), and visualization (e.g., Matplotlib, Seaborn), as well fast processing for datasets that fit comfortably in memory.

In [9]:
import pandas as pd

# Flatten the data
def flatten_data(data):
    flattened_data = []
    for entry in data:
        flat_entry = {
            'id': entry['id'],
            'conversation_id': entry['conversation_id'],
            'text': entry['text'],
            'speaker': entry['speaker'],
            'reply_to': entry.get('reply-to'),
            'timestamp': entry['timestamp'],
            'movie_id': entry['meta']['movie_id'],
        }
        # Handle nested parsed data
        for parsed in entry['meta']['parsed']:
            for idx, tok in enumerate(parsed['toks']):
                flat_entry[f'tok_{idx}_token'] = tok['tok']
                flat_entry[f'tok_{idx}_tag'] = tok['tag']
                flat_entry[f'tok_{idx}_dep'] = tok['dep']
                # Add other fields from tokens as needed
        flattened_data.append(flat_entry)
    return flattened_data

# Convert to DataFrame
flattened_data = flatten_data(utterances)
df = pd.DataFrame(flattened_data)

In [10]:
# Show DataFrame to check structure
df.head()

Unnamed: 0,id,conversation_id,text,speaker,reply_to,timestamp,movie_id,tok_0_token,tok_0_tag,tok_0_dep,...,tok_121_dep,tok_122_token,tok_122_tag,tok_122_dep,tok_123_token,tok_123_tag,tok_123_dep,tok_124_token,tok_124_tag,tok_124_dep
0,L1045,L1044,They do not!,u0,L1044,,m0,They,PRP,nsubj,...,,,,,,,,,,
1,L1044,L1044,They do to!,u2,,,m0,They,PRP,nsubj,...,,,,,,,,,,
2,L985,L984,I hope so.,u0,L984,,m0,I,PRP,nsubj,...,,,,,,,,,,
3,L984,L984,She okay?,u2,,,m0,She,PRP,nsubj,...,,,,,,,,,,
4,L925,L924,Let's go.,u0,L924,,m0,Let,VB,ROOT,...,,,,,,,,,,


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Columns: 382 entries, id to tok_124_dep
dtypes: object(382)
memory usage: 888.1+ MB


In [12]:
# Temporarily adjust display settings to show all columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df.isnull().sum())
#print(df.isnull().sum())

id                      0
conversation_id         0
text                    0
speaker                 0
reply_to            83097
timestamp          304713
movie_id                0
tok_0_token           267
tok_0_tag             267
tok_0_dep             267
tok_1_token           625
tok_1_tag             625
tok_1_dep             625
tok_2_token         21729
tok_2_tag           21729
tok_2_dep           21729
tok_3_token         38515
tok_3_tag           38515
tok_3_dep           38515
tok_4_token         63316
tok_4_tag           63316
tok_4_dep           63316
tok_5_token         92404
tok_5_tag           92404
tok_5_dep           92404
tok_6_token        121962
tok_6_tag          121962
tok_6_dep          121962
tok_7_token        149962
tok_7_tag          149962
tok_7_dep          149962
tok_8_token        175281
tok_8_tag          175281
tok_8_dep          175281
tok_9_token        196627
tok_9_tag          196627
tok_9_dep          196627
tok_10_token       214433
tok_10_tag  

## Saving the DataFrame

In [13]:
# Saving the DataFrame
file_path_parquet = os.path.join(data_dir, 'utterances.parquet')
df.to_parquet(file_path_parquet)

`utterances.jsonl` - 351 404 KB, `utterances.parquet` - 28 409 KB

## Loading the DataFrame

In [15]:
# Loading the DataFrame
file_path_parquet = os.path.join(data_dir, 'utterances.parquet')
df_loaded_parquet = pd.read_parquet(file_path_parquet)

df_loaded_parquet.head(20)

Unnamed: 0,id,conversation_id,text,speaker,reply_to,timestamp,movie_id,tok_0_token,tok_0_tag,tok_0_dep,...,tok_121_dep,tok_122_token,tok_122_tag,tok_122_dep,tok_123_token,tok_123_tag,tok_123_dep,tok_124_token,tok_124_tag,tok_124_dep
0,L1045,L1044,They do not!,u0,L1044,,m0,They,PRP,nsubj,...,,,,,,,,,,
1,L1044,L1044,They do to!,u2,,,m0,They,PRP,nsubj,...,,,,,,,,,,
2,L985,L984,I hope so.,u0,L984,,m0,I,PRP,nsubj,...,,,,,,,,,,
3,L984,L984,She okay?,u2,,,m0,She,PRP,nsubj,...,,,,,,,,,,
4,L925,L924,Let's go.,u0,L924,,m0,Let,VB,ROOT,...,,,,,,,,,,
5,L924,L924,Wow,u2,,,m0,Wow,UH,ROOT,...,,,,,,,,,,
6,L872,L870,Okay -- you're gonna need to learn how to lie.,u0,L871,,m0,Okay,UH,intj,...,,,,,,,,,,
7,L871,L870,No,u2,L870,,m0,No,UH,ROOT,...,,,,,,,,,,
8,L870,L870,I'm kidding. You know how sometimes you just ...,u0,,,m0,And,CC,cc,...,,,,,,,,,,
9,L869,L866,Like my fear of wearing pastels?,u0,L868,,m0,Like,IN,ROOT,...,,,,,,,,,,


## Data cleaning

### Leaving only necessary data for initial stage of the project
Id, conversation_id for tracking the flow of conversations and reply_to for understanding the sequence within the dialogue, and conversation text ofcourse.

In [19]:
conversations = df_loaded_parquet[['text', 'id', 'conversation_id', 'reply_to']]
conversations.head(30)

Unnamed: 0,text,id,conversation_id,reply_to
0,They do not!,L1045,L1044,L1044
1,They do to!,L1044,L1044,
2,I hope so.,L985,L984,L984
3,She okay?,L984,L984,
4,Let's go.,L925,L924,L924
5,Wow,L924,L924,
6,Okay -- you're gonna need to learn how to lie.,L872,L870,L871
7,No,L871,L870,L870
8,I'm kidding. You know how sometimes you just ...,L870,L870,
9,Like my fear of wearing pastels?,L869,L866,L868


In [20]:
# Cheking if counts of None are the same with 'id' == 'conversation_id'
print("Total entries where 'id' equals 'conversation_id':", (df['id'] == df['conversation_id']).sum())
print("Total entries where 'reply_to' is None:", df['reply_to'].isnull().sum())

Total entries where 'id' equals 'conversation_id': 83097
Total entries where 'reply_to' is None: 83097


## Conclusion
Looks like the dataset is well-structured and prepared for further processing: <br>
Conversation_id and id:  <br>
When conversation_id and id are the same and there's no reply_to, this indicates the start of a new conversation, this allows to understand where each conversation begins.  <br>
Counts of None in reply_to:  <br>
The count of None in the reply_to field matches the number of conversations (83,097). This confirms that each conversation starts with a message that does not reply to any previous message, this is the first message in the thread.  <br>
Data Cleanliness:  <br>
The alignment of these counts and the consistency of data formatting suggest that dataset is clean and structured. Each message within the dataset is correctly linked to its conversation, and the flow of conversations is well-defined.  <br>

In [21]:
conversations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   text             304713 non-null  object
 1   id               304713 non-null  object
 2   conversation_id  304713 non-null  object
 3   reply_to         221616 non-null  object
dtypes: object(4)
memory usage: 9.3+ MB


In [23]:
conversations.text.describe()

count     304713
unique    265774
top        What?
freq        1684
Name: text, dtype: object

## Analyze text of conversations

In [None]:
# !pip install textblob

In [30]:
from nltk import FreqDist, word_tokenize
from nltk.tokenize import sent_tokenize
from textblob import TextBlob
import nltk

# Ensure that the punkt tokenizer is available
nltk.download('punkt')

# Basic statistics
conversations.loc[:, 'msg_length'] = conversations['text'].apply(len)
conversations.loc[:, 'word_count'] = conversations['text'].apply(lambda x: len(word_tokenize(x)))


print("Average message length (characters):", np.mean(conversations['msg_length']))
print("Average message length (words):", np.mean(conversations['word_count']))
print("Min message length (characters):", np.min(conversations['msg_length']))
print("Max message length (characters):", np.max(conversations['msg_length']))
print("Standard deviation (characters):", np.std(conversations['msg_length']))

# Word frequency analysis
all_words = ' '.join(conversations['text']).lower()
words = word_tokenize(all_words)
freq_dist = FreqDist(words)
print("Most common words:", freq_dist.most_common(50))

# Sentiment analysis
conversations.loc[:, 'sentiment'] = conversations['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
print("Average sentiment (polarity):", np.mean(conversations['sentiment']))
print("Sentiment distribution:", conversations['sentiment'].describe())


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Average message length (characters): 55.25953930419772
Average message length (words): 13.721094931952361
Min message length (characters): 0
Max message length (characters): 3046
Standard deviation (characters): 64.06661834805733
Most common words: [('.', 332912), (',', 170188), ('you', 148400), ('i', 140952), ('?', 110240), ('the', 99132), ('to', 80649), ('a', 70839), ("'s", 66538), ('it', 66076), ("n't", 55224), ('...', 50796), ('do', 47049), ('that', 46582), ('and', 45934), ('of', 39338), ('!', 37866), ('what', 37719), ('in', 34129), ('me', 32203), ('is', 31639), ('we', 29291), ('he', 27408), ('--', 26662), ('this', 24616), ('for', 23415), ('have', 22934), ("'m", 22578), ("'re", 21717), ('know', 21657), ('was', 21407), ('your', 20962), ('my', 20824), ('not', 19883), ('on', 19560), ('no', 19425), ('be', 19414), ('are', 17600), ('but', 17321), ('with', 17249), ('they', 16942), ('just', 15853), ('all', 15392), ('like', 15007), ("'ll", 14613), ('did', 14547), ('there', 14446), ('get', 1

In [54]:
# Calculating zero characters messages
zero_length_messages = conversations[conversations['text'].apply(len) == 0]
print("Number of zero-length messages:", len(zero_length_messages))
zero_length_messages.sample(10)

Number of zero-length messages: 267


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment
153596,,L128654,L128650,L128653,0,0,0.0
153605,,L128644,L128625,L128643,0,0,0.0
153351,,L129302,L129301,L129301,0,0,0.0
168756,,L166989,L166988,L166988,0,0,0.0
153915,,L128979,L128978,L128978,0,0,0.0
153384,,L129179,L129168,L129178,0,0,0.0
154115,,L128754,L128752,L128753,0,0,0.0
153573,,L128677,L128650,L128676,0,0,0.0
271158,,L556463,L556443,L556462,0,0,0.0
299552,,L649938,L649936,L649937,0,0,0.0


Same id and conversation_id with None in reply_to - these messages likely represent the start of a conversation. Removing them could impact the structure of the conversation as it might remove the entry point for a conversational thread.
Different id and conversation_id with a specific reply_to - these are responses within a conversation. Their removal might disrupt the sequence, making it difficult to follow the flow of the conversation.
Messages with a specific reply_to - these indicate replies within the conversation sequence. Removing these could create gaps in the conversation history. I've decided to leave zero text conversations for now, besides thera are only 267 of these.

In [36]:
# Calculating long messages
long_messages = conversations[conversations['msg_length'] > 300]
print("Number of long messages:", len(long_messages))


Number of long messages: 3151


In [53]:
# Print sample long messages
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

Oh, God...Oh, God...  Jim, excuse me...Ray, I told you, who he is is the senior vice- president American Express.  His family owns 32 per...Over the past years I've sold him...I can't tell you the dollar amount, but quite a lot of land.  I promised five weeks ago that I'd go to the wife's birthday party in Kenilworth tonight.  I have to go.  You understand. They treat me like a member of the family, so I have to go.


Decided to leave for now long messages - considering to use advanced NLP models such as BERT or GPT (from the transformer family), which are adept at understanding context over longer stretches of text.

## Conversations text preprocessing

### Normalize text

In [58]:
conversations.loc[:, 'text'] = conversations['text'].str.lower().str.strip()
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

it was warmer than i thought it would be, and the skin was softer than it looked. it's weird. thinking about it now, the organ itself seemed like a separate thing, a separate entity to me. i mean, after he pulled it out and i could look at it and touch it, i completely forgot that there was a guy attached to it. i remember literally being startled when the guy spoke to me.


### Remove punctuation

In [60]:
import string
conversations.loc[:, 'text'] = conversations['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

last spring i happened to walk past a house that i had once patronized there was a cool breeze blowing off the ocean and through the window i could see a bare leg the girl must have been taking a break between customers it was a strange moment for me because it reminded me of my mother and despite the fact that i was late for something already i just stayed there loving the atmosphere of it and my memory and the reason im telling you this epilogue is that i felt that id come full circle


### Remove Stopwords

In [61]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
conversations.loc[:, 'text'] = conversations['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
long_messages = conversations[conversations['msg_length'] > 300]
sampled_text = long_messages.sample(1)['text'].iloc[0]
print(sampled_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


uwouldau gone wouldnta thought twice trust thats kinda guy guy like momentary loss sanity wasnt thinking clearly listen im hero john want dough maybe little favor much didja spend already dogooder bullshit didnt spend uallu didja


## Analyze again text of conversations

In [62]:
# Basic statistics
conversations.loc[:, 'msg_length'] = conversations['text'].apply(len)
conversations.loc[:, 'word_count'] = conversations['text'].apply(lambda x: len(word_tokenize(x)))


print("Average message length (characters):", np.mean(conversations['msg_length']))
print("Average message length (words):", np.mean(conversations['word_count']))
print("Min message length (characters):", np.min(conversations['msg_length']))
print("Max message length (characters):", np.max(conversations['msg_length']))
print("Standard deviation (characters):", np.std(conversations['msg_length']))

# Word frequency analysis
all_words = ' '.join(conversations['text']).lower()
words = word_tokenize(all_words)
freq_dist = FreqDist(words)
print("Most common words:", freq_dist.most_common(50))

# Sentiment analysis
conversations.loc[:, 'sentiment'] = conversations['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
print("Average sentiment (polarity):", np.mean(conversations['sentiment']))
print("Sentiment distribution:", conversations['sentiment'].describe())


Average message length (characters): 31.95437674139272
Average message length (words): 5.361769927768097
Min message length (characters): 0
Max message length (characters): 1836
Standard deviation (characters): 38.741096498848194
Most common words: [('dont', 24414), ('im', 22433), ('know', 21509), ('like', 14943), ('get', 14114), ('youre', 13615), ('got', 13268), ('well', 11813), ('want', 11030), ('thats', 10946), ('think', 10731), ('one', 10297), ('right', 9963), ('go', 9878), ('going', 8861), ('see', 8178), ('oh', 7721), ('yes', 7287), ('good', 7277), ('ill', 6922), ('yeah', 6871), ('tell', 6807), ('come', 6735), ('hes', 6691), ('cant', 6634), ('time', 6522), ('back', 6179), ('would', 6110), ('say', 6089), ('us', 5942), ('didnt', 5907), ('look', 5873), ('could', 5795), ('take', 5664), ('man', 5549), ('never', 5433), ('something', 5420), ('ive', 5293), ('na', 5279), ('mean', 5009), ('way', 4934), ('whats', 4875), ('make', 4736), ('really', 4558), ('okay', 4536), ('little', 4501), ('su

### Tokenize text

In [63]:
conversations.loc[:, 'tokens'] = conversations['text'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  conversations['tokens'] = conversations['text'].apply(word_tokenize)


In [75]:
conversations[1200: 1210]

Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
1200,czechoslovakia slavs fighting germans russians...,L2897,L2895,L2896,99,12,0.1,"[czechoslovakia, slavs, fighting, germans, rus..."
1201,eastern europe like romania hungary,L2896,L2895,L2895,35,5,0.0,"[eastern, europe, like, romania, hungary]"
1202,maybe ritual thing someone trying send message...,L2895,L2895,,150,21,-0.166667,"[maybe, ritual, thing, someone, trying, send, ..."
1203,look im even sure anything saw outside fire th...,L2893,L2892,L2892,106,17,0.25,"[look, im, even, sure, anything, saw, outside,..."
1204,would call,L2892,L2892,,10,2,0.0,"[would, call]"
1205,says shes suspect,L2891,L2887,L2890,17,3,0.0,"[says, shes, suspect]"
1206,maybe dont care either prettiest suspect ive a...,L2890,L2887,L2889,51,8,0.0,"[maybe, dont, care, either, prettiest, suspect..."
1207,hmmmm,L2889,L2887,L2888,5,1,0.0,[hmmmm]
1208,pretty,L2888,L2887,L2887,6,1,0.25,[pretty]
1209,super said hed seen didnt live,L2887,L2887,,30,6,0.234848,"[super, said, hed, seen, didnt, live]"


### Lemmatize words

In [76]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
conversations.loc[:, 'tokens'] = conversations['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
conversations[1200: 1210]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tomui\AppData\Roaming\nltk_data...


Unnamed: 0,text,id,conversation_id,reply_to,msg_length,word_count,sentiment,tokens
1200,czechoslovakia slavs fighting germans russians...,L2897,L2895,L2896,99,12,0.1,"[czechoslovakia, slav, fighting, german, russi..."
1201,eastern europe like romania hungary,L2896,L2895,L2895,35,5,0.0,"[eastern, europe, like, romania, hungary]"
1202,maybe ritual thing someone trying send message...,L2895,L2895,,150,21,-0.166667,"[maybe, ritual, thing, someone, trying, send, ..."
1203,look im even sure anything saw outside fire th...,L2893,L2892,L2892,106,17,0.25,"[look, im, even, sure, anything, saw, outside,..."
1204,would call,L2892,L2892,,10,2,0.0,"[would, call]"
1205,says shes suspect,L2891,L2887,L2890,17,3,0.0,"[say, shes, suspect]"
1206,maybe dont care either prettiest suspect ive a...,L2890,L2887,L2889,51,8,0.0,"[maybe, dont, care, either, prettiest, suspect..."
1207,hmmmm,L2889,L2887,L2888,5,1,0.0,[hmmmm]
1208,pretty,L2888,L2887,L2887,6,1,0.25,[pretty]
1209,super said hed seen didnt live,L2887,L2887,,30,6,0.234848,"[super, said, hed, seen, didnt, live]"


## Preparing for training
Format data to be suitable for chatbot training.