##Using minet to download tweets containing our selection criteria

Minet is the Swiss army knife of web mining that was developed at the Medialab of SciencesPo: https://github.com/medialab/minet


In [None]:
! pip install minet

Collecting minet
  Downloading minet-0.55.9-py3-none-any.whl (168 kB)
[K     |████████████████████████████████| 168 kB 6.9 MB/s 
[?25hCollecting colorama>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting json5>=0.8.5
  Downloading json5-0.9.6-py2.py3-none-any.whl (18 kB)
Collecting browser-cookie3==0.13.0
  Downloading browser-cookie3-0.13.0.tar.gz (9.4 kB)
Collecting dateparser>=1.0.0
  Downloading dateparser-1.1.0-py2.py3-none-any.whl (288 kB)
[K     |████████████████████████████████| 288 kB 42.2 MB/s 
[?25hCollecting tenacity>=7.0.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Collecting persist-queue>=0.7.0
  Downloading persist_queue-0.7.0-py2.py3-none-any.whl (32 kB)
Collecting ndjson>=0.3.1
  Downloading ndjson-0.3.1-py2.py3-none-any.whl (5.3 kB)
Collecting lxml>=4.3.0
  Downloading lxml-4.6.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 30.1 MB/s 

In [None]:
! minet twitter scrape --help

usage: minet twitter scrape
       [-h]
       [--include-refs]
       [-l LIMIT]
       [-o OUTPUT]
       [--query-template QUERY_TEMPLATE]
       [-s SELECT]
       {tweets}
       query
       [file]

Minet Twitter Scrape Command

Scrape Twitter's public facing search API to collect tweets etc.

Be sure to check Twitter's advanced search to check what kind of
operators you can use to tune your queries (time range, hashtags,
mentions, boolean etc.):
https://twitter.com/search-advanced?f=live

Useful operators include "since" and "until" to search specific
time ranges like so: "since:2014-01-01 until:2017-12-31".

positional arguments:
  {tweets}
    What to scrape. Currently only `tweets` is possible.
  query
    Search query or name of the column containing queries to run in given CSV file.
  file
    Optional CSV file containing the queries to be run.

optional arguments:
  -h, --help
    show this help message and exit
  --include-refs
    Whether to emit referenced tweets (quote



### Minet allows us to filter and search for tweets using the advanced search criterias of Twitter

These include 
1.   A list of Hashtags, where at least one of them has to included in the tweet
2. The language the tweet is written in
3. If the tweet contains external links or images
4. If it is an original tweet or if a reply to another tweet

We also aplied a limit of 25000 tweets to our data collection to not overwhelm our analytical capabilities





In [None]:
! minet twitter scrape tweets "(#lgbt OR #gay OR #pride OR #lesbian OR #queer) lang:en -filter:links -filter:replies"  --limit 25000 > LGBT.csv


In [None]:
! minet twitter scrape tweets "(#blacklivesmatter OR #problack OR #blackandproud OR #blackpride OR #blackempowerment) -filter:links -filter:replies"  --limit 25000 > Racial.csv


In [None]:
! minet twitter scrape tweets "(#effyourbeautystandards OR #bodypositive OR #celebratemysize OR #honormycurves OR #allbodiesarebeautiful) lang:en -filter:links -filter:replies"  --limit 25000 > Bodypositiv.csv


In [None]:
! minet twitter scrape tweets "(#feminism OR #womenrights OR #womenempowerment OR #mydressmychoice OR #ShoutYourAbortion) lang:en -filter:links -filter:replies"  --limit 25000 > Women.csv


# Scattertext for Wordvisualisation

The tool scattertext allows for interactive, effective and comprehensive text visualisations. 

The following code was used to create the .html files of the scattertext plots displayed on the Github page and in the repository. 

In [None]:
!pip install scattertext
!pip install spacy

Collecting scattertext
  Downloading scattertext-0.1.5-py3-none-any.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 5.4 MB/s 
[?25hCollecting gensim>=4.0.0
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.5 MB/s 
Collecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Collecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9310 sha256=52ebc336ebb32d37ac177363b53000b0b285d91e1470ff66568d426ee2565cb0
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: mock, gensim, flashtext, scattertext
  Attempting uninstall: gensim
    Found existing installation: gens

In [None]:
import scattertext as st
import pandas as pd
import spacy as spacy


In [None]:
df = pd.read_excel("/content/Final_LGBTQ_clean.xlsx")
df.dtypes
df.astype({'Dummy': 'string'}).dtypes
df.astype({'text': 'string'}).dtypes
df=df.replace({0:"non-toxic",1:"toxic"})
nlp = spacy.load("en_core_web_sm")
corpus = st.CorpusFromPandas(
    df, category_col="Dummy", text_col="text", nlp=nlp
).build()

sent = st.produce_scattertext_explorer(
    corpus,
    category="toxic",
    category_name="toxic",
    not_category_name="non-toxic",
    width_in_pixels=1000,
)

open('./toxic_LGBTQ_FINAL.html', 'w').write(sent)



6878516

In [None]:
df = pd.read_excel("/content/final_Women_clean.xlsx")
df.dtypes
df.astype({'Dummy': 'string'}).dtypes
df.astype({'text': 'string'}).dtypes
df=df.replace({0:"non-toxic",1:"toxic"})
nlp = spacy.load("en_core_web_sm")
corpus = st.CorpusFromPandas(
    df, category_col="Dummy", text_col="text", nlp=nlp
).build()

sent = st.produce_scattertext_explorer(
    corpus,
    category="toxic",
    category_name="toxic",
    not_category_name="non-toxic",
    width_in_pixels=1000,
)

open('./toxic_Women_FINAL.html', 'w').write(sent)



7001937

In [None]:
df = pd.read_excel("/content/final_bodypositive_clean.xlsx")
df.dtypes
df.astype({'Dummy': 'string'}).dtypes
df.astype({'text': 'string'}).dtypes
df=df.replace({0:"non-toxic",1:"toxic"})
nlp = spacy.load("en_core_web_sm")
corpus = st.CorpusFromPandas(
    df, category_col="Dummy", text_col="text", nlp=nlp
).build()

sent = st.produce_scattertext_explorer(
    corpus,
    category="toxic",
    category_name="toxic",
    not_category_name="non-toxic",
    width_in_pixels=1000,
)

open('./toxic_Bodypositive_FINAL.html', 'w').write(sent)



4904475

In [None]:
df = pd.read_excel("/content/final_racial_clean.xlsx")
df.dtypes
df.astype({'Dummy': 'string'}).dtypes
df.astype({'text': 'string'}).dtypes
df=df.replace({0:"non-toxic",1:"toxic"})
nlp = spacy.load("en_core_web_sm")
corpus = st.CorpusFromPandas(
    df, category_col="Dummy", text_col="text", nlp=nlp
).build()

sent = st.produce_scattertext_explorer(
    corpus,
    category="toxic",
    category_name="toxic",
    not_category_name="non-toxic",
    width_in_pixels=1000,
)

open('./toxic_Racial_FINAL.html', 'w').write(sent)

5199551