## Text mining workbook

In this workbook, we will be using data gathered from a community event in Austin, Texas hosted by the police to discuss racial profiling.

All of the required packages and modules can be imported by running the following cell:

In [38]:
# !pip install -U nltk
# !pip install -U textblob


In [39]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet 
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /Users/tommadden/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tommadden/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
df = pd.read_csv('data/policing.csv')
df.head()

Unnamed: 0,Response,Question,Group,Topic,Theme
0,I have been an Austin resident for 1.5 years w...,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
1,Long term Austin resident and UT student,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
2,works for councilmember Delia Garza,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
3,to learn and gain new perspective,What was your motivation for attending this ev...,3,Motivations and Feelings,People expressed interest in this community di...
4,"Social work experience, and to take info/exper...",What was your motivation for attending this ev...,5,Motivations and Feelings,People expressed interest in this community di...


### Part 1: Word Frequency

Run the following code cell to assign the text to the variable `motivation_question`

In [41]:
motivation_question = 'What was your motivation for attending this event?'

**Q1)** Assign to the variable `motivation`, a DataFrame containing entries in `df` where `Question` equals the `motivation_question` given above:

In [42]:
# Add your code below
# motivation = ...
df.query(f'Question == "{motivation_question}"').head()


Unnamed: 0,Response,Question,Group,Topic,Theme
0,I have been an Austin resident for 1.5 years w...,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
1,Long term Austin resident and UT student,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
2,works for councilmember Delia Garza,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
3,to learn and gain new perspective,What was your motivation for attending this ev...,3,Motivations and Feelings,People expressed interest in this community di...
4,"Social work experience, and to take info/exper...",What was your motivation for attending this ev...,5,Motivations and Feelings,People expressed interest in this community di...


**Q2)** Create a `list`, assigned to the variable `motivation_words`, which contains all of the words found in the `Response` column of the `motivation` DataFrame: 

- create an empty list `motivation_words`
- using a `for` loop, iterate through each entry in `motivation['Response']` and use `word_tokenize()` to get the separate words (`tokens`)
- add all `tokens` into `motivation_words` list using `.extend()` method
- don't for now remove any duplicates or stop words from the list, or change the case

In [56]:
# motivation_words = ...
response_lst =  (df.query(f'Question == "{motivation_question}"')
                 ['Response'].values
                )
motivation_words = []

for i in range(len(response_lst)):
    tokens =  word_tokenize(response_lst[i])
    motivation_words.extend(tokens)
    
motivation_words[:10]


['I',
 'have',
 'been',
 'an',
 'Austin',
 'resident',
 'for',
 '1.5',
 'years',
 'with']

Assign to the variable `top_5` a `list` of the five most common words in `motivation_words`:
- use `nltk.freqdist` and the `.most_common()` method
- extract the words from the resulting tuples

In [57]:
freqdist = nltk.FreqDist(motivation_words).most_common(5)
freqdist

[('and', 21), ('to', 16), (',', 11), ('the', 10), ('for', 9)]

In [58]:
top_5 = [entry[0] for entry in freqdist]
top_5

['and', 'to', ',', 'the', 'for']

**Q3)** Create a new list of words, assigned to `motivation_clean`, based on `motivation_words` but with:

- all English stopwords removed
- all words in lower case
- words containing only alphabetical characters


*For the alhpabetical characters requirement, consider using the Python [.isalpha()](https://www.w3schools.com/python/ref_string_isalpha.asp) method.*

In [59]:
# Add your code below

sw = stopwords.words('english')
motivation_clean = [w.lower() for w in motivation_words 
                    if (w not in sw) 
                    and (w.isalpha())
                   ]



Let’s explore the creation of the DataFrame `top_50_df`, with two columns:

- `Word`, containing the 50 most common words in `motivation_clean`
- `Count`, containing the number of occurrences of the given word in `motivation_clean`

*These values can be found in the list of tuples created when using `nltk.FreqDist()` and the `.most_common()` method.*

In [60]:
pd.Series(motivation_clean).value_counts()

police         8
austin         8
community      7
experience     6
resident       5
              ..
opportunity    1
commander      1
input          1
curious        1
healthcare     1
Length: 171, dtype: int64

In [61]:
# repaeat analysis with lemmas
wnl = WordNetLemmatizer()

# summary_lemmas = [wnl.lemmatize(word) for word in motivation_clean]
# pd.Series(summary_lemmas).value_counts()

freqdist_clean = nltk.FreqDist(motivation_clean).most_common(50)
top_50_df = pd.DataFrame(freqdist_clean, columns=['Word', 'Count'])
top_50_df[:10]

Unnamed: 0,Word,Count
0,austin,8
1,police,8
2,community,7
3,experience,6
4,resident,5
5,years,4
6,perspective,4
7,solutions,4
8,people,4
9,issues,4


In [64]:
# repeat analysis with synonyms
from nltk.corpus import wordnet 

synonyms = []   
all_synonyms = []

for syn in wordnet.synsets(motivation_clean[5]): 
    for lemma in syn.lemmas(): 
        synonyms.append(lemma.name()) 
print(set(synonyms))


{'household', 'phratry', 'syndicate', 'family_unit', 'category', 'kin', 'crime_syndicate', 'menage', 'kinsfolk', 'house', 'kinsperson', 'folk', 'sept', 'family_line', 'kinfolk', 'family', 'class', 'home', 'fellowship', 'mob'}


Consider the resulting DataFrame, and whether `lemmatisation` or `stemming` might be appropriate (or some other form of grouping the words).

There are no definite answers - we might think that 'officer' and 'officers' could be grouped, but perhaps these could also be grouped with 'police' and even 'apd' (Austin Police Department).

We might consider using synonyms or suchlike but it may also be more appropriate to group manually, or not at all.

### Part 2: Sentiment Analysis

Adding a column to `df` called `Response_Sentiment`, which contains a value given by the `.sentiment.polarity` attribute of a `TextBlob` object created from the text of each entry in the `Response` column:

    The polarity score is a float within the range [-1.0, 1.0]
    -1.0 defines a negative sentiment and 1.0 defines a positive sentiment

In [73]:
df['Response_Sentiment'] = df['Response'].apply(lambda x: TextBlob(x).sentiment.polarity)
df[['Response', 'Response_Sentiment']].sort_values('Response_Sentiment', ascending=False)

Unnamed: 0,Response,Response_Sentiment
169,Interventions for officers: what does that ent...,0.75
45,A lot of good recommendations in report,0.70
233,Often there is no follow-up after trainings. E...,0.60
225,Wealthy police department,0.50
290,I'd feel safe if they take away guns,0.50
...,...,...
179,Develop consequences for bad practices,-0.70
246,Frustrated by deflection,-0.70
250,Frustrated by unwillingness to give up control,-0.70
228,Frustrated by emphasis on training,-0.70


We can then use the `.describe()` method on the Series of values:

In [74]:
df['Response_Sentiment'].describe()

count    329.000000
mean       0.048662
std        0.209676
min       -0.700000
25%        0.000000
50%        0.000000
75%        0.100000
max        0.750000
Name: Response_Sentiment, dtype: float64

Run the following code cell to assign the text to the variable `feedback_question`

In [75]:
feedback_question = "How do you feel about what you have learned about the Racial Profiling Report so far? \
What came up for you reading the data or listening to the panel?"

**Q4)** Assign to the variable `feedback`, a DataFrame containing entries in `df` where `Question` equals the `feedback_question` given above:

In [83]:
# Add your code below
feedback = (df.query(f'Question == "{feedback_question}"')) 

feedback



Unnamed: 0,Response,Question,Group,Topic,Theme,Response_Sentiment
36,Data source: glad to have detailed sources cited,How do you feel about what you have learned ab...,5,Data,People highlighted the strengths of the racial...,0.450000
37,Translate reports in Spanish,How do you feel about what you have learned ab...,3,Data,People highlighted the strengths of the racial...,0.000000
38,Surprised by the depth and data,How do you feel about what you have learned ab...,5,Data,People highlighted the strengths of the racial...,0.100000
39,This report is a step above what's been done i...,How do you feel about what you have learned ab...,1,Data,People highlighted the strengths of the racial...,0.135625
40,Surprised by SD23 zero-disparity goal,How do you feel about what you have learned ab...,5,Data,People highlighted the strengths of the racial...,0.100000
...,...,...,...,...,...,...
313,Q: How have the policies changed based on trai...,How do you feel about what you have learned ab...,1,,People put forth questions about the report an...,0.000000
314,"Q: When arrests of students of color made, whe...",How do you feel about what you have learned ab...,3,,People put forth questions about the report an...,0.000000
315,Q: How are hot spots defined? Is it by calls? ...,How do you feel about what you have learned ab...,1,,People put forth questions about the report an...,0.232143
316,Q: Was there community input on the person hir...,How do you feel about what you have learned ab...,1,,People put forth questions about the report an...,0.000000


We can see that there are three different values in the `Topic` column, with a fairly even distribution between them:

In [84]:
feedback['Topic'].value_counts()

Racism/Systems    28
Accountability    21
Data              17
Name: Topic, dtype: int64

Creation of a DataFrame called `topic_sentiment`, using `.groupby()` on `feedback`, which shows the `mean`, `min`, and `max` `Response_Sentiment` values for each `Topic`: 

*You may find the pandas `.agg(['mean', 'min', 'max'])` method useful.*

In [85]:
topic_sentiment = feedback.groupby('Topic')['Response_Sentiment'].agg(['mean', 'min', 'max'])
topic_sentiment

Unnamed: 0_level_0,mean,min,max
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Accountability,0.004603,-0.7,0.5
Data,0.082406,-0.222222,0.45
Racism/Systems,-0.05744,-0.7,0.4


We would like to see if there was much variation in sentiment between the five different groups which participated.

Creation of a DataFrame called `group_sentiment` (adapting the same code as above) to find the same `Response_Sentiment` metrics, this time for all `Responses` in `df` and grouped by `Group`:

In [96]:
group_sentiment = feedback.groupby(['Group','Topic'])['Response_Sentiment'].agg(['mean','min','max'])
group_sentiment

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,min,max
Group,Topic,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Accountability,-0.057778,-0.2,0.16
1,Data,-0.009844,-0.125,0.135625
1,Racism/Systems,0.166667,0.166667,0.166667
3,Accountability,0.067,-0.25,0.5
3,Data,0.05,0.0,0.2
3,Racism/Systems,-0.028704,-0.25,0.125
4,Accountability,-0.1,-0.7,0.2
4,Data,-0.111111,-0.222222,0.0
4,Racism/Systems,-0.084259,-0.7,0.4
5,Accountability,0.033333,0.0,0.1
