## Text mining workbook

In this workbook, we will be using data gathered from a community event in Austin, Texas hosted by the police to discuss racial profiling.

All of the required packages and modules can be imported by running the following cell:

In [3]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet 
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

In [4]:
df = pd.read_csv('data/policing.csv')
df.head()

Unnamed: 0,Response,Question,Group,Topic,Theme
0,I have been an Austin resident for 1.5 years w...,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
1,Long term Austin resident and UT student,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
2,works for councilmember Delia Garza,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
3,to learn and gain new perspective,What was your motivation for attending this ev...,3,Motivations and Feelings,People expressed interest in this community di...
4,"Social work experience, and to take info/exper...",What was your motivation for attending this ev...,5,Motivations and Feelings,People expressed interest in this community di...


### Part 1: Word Frequency

Run the following code cell to assign the text to the variable `motivation_question`

In [5]:
motivation_question = 'What was your motivation for attending this event?'

**Q1)** Assign to the variable `motivation`, a DataFrame containing entries in `df` where `Question` equals the `motivation_question` given above:

In [6]:
# Add your code below
# motivation = ...

motivation = df[df['Question'] == motivation_question]
motivation.head()


Unnamed: 0,Response,Question,Group,Topic,Theme
0,I have been an Austin resident for 1.5 years w...,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
1,Long term Austin resident and UT student,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
2,works for councilmember Delia Garza,What was your motivation for attending this ev...,1,Motivations and Feelings,People expressed interest in this community di...
3,to learn and gain new perspective,What was your motivation for attending this ev...,3,Motivations and Feelings,People expressed interest in this community di...
4,"Social work experience, and to take info/exper...",What was your motivation for attending this ev...,5,Motivations and Feelings,People expressed interest in this community di...


**Q2)** Create a `list`, assigned to the variable `motivation_words`, which contains all of the words found in the `Response` column of the `motivation` DataFrame: 

- create an empty list `motivation_words`
- using a `for` loop, iterate through each entry in `motivation['Response']` and use `word_tokenize()` to get the separate words (`tokens`)
- add all `tokens` into `motivation_words` list using `.extend()` method
- don't for now remove any duplicates or stop words from the list, or change the case

In [7]:
# Add your code below
# motivation_words = ...

motivation_words = []

for text in motivation['Response']:
    words = word_tokenize(text)
    motivation_words.extend(words)

motivation_words


['I',
 'have',
 'been',
 'an',
 'Austin',
 'resident',
 'for',
 '1.5',
 'years',
 'with',
 'family',
 'here',
 ';',
 'family',
 'is',
 'racially',
 'mixed',
 ',',
 'my',
 'son',
 'is',
 'stopped',
 'a',
 'lot',
 'in',
 'Philadelphia',
 'Long',
 'term',
 'Austin',
 'resident',
 'and',
 'UT',
 'student',
 'works',
 'for',
 'councilmember',
 'Delia',
 'Garza',
 'to',
 'learn',
 'and',
 'gain',
 'new',
 'perspective',
 'Social',
 'work',
 'experience',
 ',',
 'and',
 'to',
 'take',
 'info/experience',
 'back',
 'to',
 'their',
 'commision',
 'To',
 'better',
 'understand',
 'Help',
 'contribute',
 'to',
 'discussion',
 'and',
 'work',
 'on',
 'solutions',
 'Self',
 'education',
 'Share',
 'outcomes',
 'Anti-racism',
 'consulting',
 ',',
 'power',
 'dynamics',
 'in',
 'social',
 'systems',
 'Public',
 'policy',
 'researcher',
 'and',
 'grad',
 'student',
 'at',
 'UT',
 ',',
 'studies',
 'community',
 'and',
 'police',
 'relations',
 'and',
 'does',
 'stuff',
 'with',
 'Equity',
 'Office',
 

Assign to the variable `top_5` a `list` of the five most common words in `motivation_words`:
- use `nltk.freqdist` and the `.most_common()` method
- extract the words from the resulting tuples

In [8]:
freqdist = nltk.FreqDist(motivation_words).most_common(5)
freqdist

[('and', 21), ('to', 16), (',', 11), ('the', 10), ('for', 9)]

In [10]:
top_5 = [entry[0] for entry in freqdist]
top_5

['and', 'to', ',', 'the', 'for']


**Q3)** Create a new list of words, assigned to `motivation_clean`, based on `motivation_words` but with:

- all English stopwords removed
- all words in lower case
- words containing only alphabetical characters


*For the alhpabetical characters requirement, consider using the Python [.isalpha()](https://www.w3schools.com/python/ref_string_isalpha.asp) method.*

In [11]:
# Add your code below
# motivation_clean = ...

motivation_clean = [word.lower() for word in motivation_words 
      if word.lower() not in stopwords.words('english') 
      and word.isalpha()]

motivation_clean


['austin',
 'resident',
 'years',
 'family',
 'family',
 'racially',
 'mixed',
 'son',
 'stopped',
 'lot',
 'philadelphia',
 'long',
 'term',
 'austin',
 'resident',
 'ut',
 'student',
 'works',
 'councilmember',
 'delia',
 'garza',
 'learn',
 'gain',
 'new',
 'perspective',
 'social',
 'work',
 'experience',
 'take',
 'back',
 'commision',
 'better',
 'understand',
 'help',
 'contribute',
 'discussion',
 'work',
 'solutions',
 'self',
 'education',
 'share',
 'outcomes',
 'consulting',
 'power',
 'dynamics',
 'social',
 'systems',
 'public',
 'policy',
 'researcher',
 'grad',
 'student',
 'ut',
 'studies',
 'community',
 'police',
 'relations',
 'stuff',
 'equity',
 'office',
 'race',
 'equity',
 'journey',
 'curious',
 'hear',
 'community',
 'input',
 'apd',
 'commander',
 'years',
 'opportunity',
 'hear',
 'community',
 'achieve',
 'city',
 'goals',
 'lived',
 'austin',
 'since',
 'saw',
 'disparities',
 'community',
 'engagement',
 'community',
 'engagement',
 'affordable',
 'housi

Let’s explore the creation of the DataFrame `top_50_df`, with two columns:

- `Word`, containing the 50 most common words in `motivation_clean`
- `Count`, containing the number of occurrences of the given word in `motivation_clean`

*These values can be found in the list of tuples created when using `nltk.FreqDist()` and the `.most_common()` method.*

In [12]:
freqdist_clean = nltk.FreqDist(motivation_clean).most_common(50)
top_50_df = pd.DataFrame(freqdist_clean, columns=['Word', 'Count'])
top_50_df

Unnamed: 0,Word,Count
0,austin,8
1,police,8
2,community,7
3,experience,6
4,resident,5
5,years,4
6,perspective,4
7,solutions,4
8,people,4
9,issues,4


Consider the resulting DataFrame, and whether `lemmatisation` or `stemming` might be appropriate (or some other form of grouping the words).

There are no definite answers - we might think that 'officer' and 'officers' could be grouped, but perhaps these could also be grouped with 'police' and even 'apd' (Austin Police Department).

We might consider using synonyms or suchlike but it may also be more appropriate to group manually, or not at all.

### Part 2: Sentiment Analysis

Adding a column to `df` called `Response_Sentiment`, which contains a value given by the `.sentiment.polarity` attribute of a `TextBlob` object created from the text of each entry in the `Response` column:

    The polarity score is a float within the range [-1.0, 1.0]
    -1.0 defines a negative sentiment and 1.0 defines a positive sentiment

In [None]:
df['Response_Sentiment'] = df['Response'].apply(lambda x: TextBlob(x).sentiment.polarity)
df[['Response', 'Response_Sentiment']]

We can then use the `.describe()` method on the Series of values:

In [None]:
df['Response_Sentiment'].describe()

Run the following code cell to assign the text to the variable `feedback_question`

In [None]:
feedback_question = "How do you feel about what you have learned about the Racial Profiling Report so far? \
What came up for you reading the data or listening to the panel?"

**Q4)** Assign to the variable `feedback`, a DataFrame containing entries in `df` where `Question` equals the `feedback_question` given above:

In [None]:
# Add your code below
# feedback = ...

feedback = df[df['Question'] == feedback_question]
feedback.head()


We can see that there are three different values in the `Topic` column, with a fairly even distribution between them:

In [None]:
feedback['Topic'].value_counts()

Creation of a DataFrame called `topic_sentiment`, using `.groupby()` on `feedback`, which shows the `mean`, `min`, and `max` `Response_Sentiment` values for each `Topic`: 

*You may find the pandas `.agg(['mean', 'min', 'max'])` method useful.*

In [None]:
topic_sentiment = feedback.groupby('Topic')['Response_Sentiment'].agg(['mean', 'min', 'max'])
topic_sentiment

We would like to see if there was much variation in sentiment between the five different groups which participated.

Creation of a DataFrame called `group_sentiment` (adapting the same code as above) to find the same `Response_Sentiment` metrics, this time for all `Responses` in `df` and grouped by `Group`:

In [None]:
group_sentiment = df.groupby('Group')['Response_Sentiment'].agg(['mean', 'min', 'max'])
group_sentiment