# Lab 1: Getting and Exploring Data with Minet and Python Pandas

What we will do:

1. Explain this programming environment
2. Scrape some Tweets based on a keyword search using the *minet* package
3. Use the pandas package to explore the data and generate some descriptive statistics and visualisations (unfortunately no networks today)
4. Learn some Python and command line principles on the way (if you didn't know it before)

There will be two versions of this so called Jupyter Notebook for you to follow along:

* One already filled out for you, in case you want to pay more attention on other things than typing or rather alter the code to try new things.
* Another one with the code 'cells' emptied for you to practice your Python typing skills alongside the lecturer (or maybe sometimes find even better solutions to the given problems)

But now let's start.

## Get to know the minet package

Let's check whether minet is correctly setup in this programming environment.

You can always look up instructions on how to use it in its [documentation](https://github.com/medialab/minet/blob/master/docs/cli.md).

The output of this cell should be something like `minet 0.67.1`

In [None]:
!minet --version

Let's call for help.

In [None]:
!minet --help

We actually want twitter data, so let's try that

In [None]:
!minet twitter

Not sure whether the API is still working, so we choose scraping.

In [None]:
!minet twitter scrape -h

We're interested in discussions about Germany giving battle tanks to Ukraine. So, let's try to scrape 100 tweets, just to try our query, containing the word `Leopard` (the name of a German tank model most requested by Ukraine).

In [None]:
!minet twitter scrape tweets -l 10 "Leopard"

Guess, we have to refine the query … 

In [None]:
!minet twitter scrape tweets -l 10 "(ukraine Germany) AND (tank OR tanks OR leopard)"

Meh, still not good enough?

In [None]:
!minet twitter scrape tweets -l 10 "(Ukraine Germany) AND (tank OR tanks OR leopard) AND (deliver OR delivery OR delivers)"

Ok, this looks better. But we want more tweets, and this will be too much to view here. So let's write to a CSV called `leo_tweets.csv`

In [None]:
!minet twitter scrape tweets -l 10 "(Ukraine Germany) AND (tank OR tanks OR leopard) AND (deliver OR delivery OR delivers)" -o leo_tweets.csv

Now, open the CSV file on the left to have a look at it whether everything looks ok.

Then come back and we'll collect tweets since the beginning of this year.

(And go for a coffee in the meantime. Should take about 3 minutes.)

In [None]:
!minet twitter scrape tweets "(Ukraine Germany) AND (tank OR tanks OR leopard) AND (deliver OR delivery OR delivers) since:2023-01-01" -o leo_tweets.csv

For the remainder of this tutorial we will use Pandas. Pandas is basically a swiss army knife for data wrangling and analysis in Python. Think of it as R, but in Python.

You can always look up its documentation [here](https://pandas.pydata.org/docs/user_guide/index.html).

First we need to import the package with `import pandas as pd`.

## Explore the Data

In [None]:
import pandas as pd

Then we read in the data with `pd.read_csv`. You can always get help in Jupyter by writing a question mark behind a command and run the cell. Also, try using the TAB key for triggering autocompletion!

In [None]:
pd.read_csv('leo_tweets.csv')

That's a nice display of the data. But, actually, we want to store it in a variable. Let's call it `df` for dataframe

In [None]:
df = pd.read_csv('leo_tweets.csv')

In [None]:
df

Let's parse the dates with help of the documentation of the read_csv function.

In [None]:
df = pd.read_csv('leo_tweets.csv', parse_dates=['local_time'])

In [None]:
df

Let's see who tweeted the most with the groupby and count command.

In [None]:
df.groupby('user_screen_name')['id'].count().sort_values(ascending=False)

And let's make with the top 30 a nice bar plot with the plot function.

In [None]:
df.groupby('user_screen_name')['id'].count().sort_values(ascending=False)[:30].plot(kind='bar')

Let's look at their user descriptions.

In [None]:
top_30 = df.groupby('user_screen_name')['id'].count().sort_values(ascending=False)[:30]

top_30_with_descriptions = pd.merge(top_30, df, left_index=True, right_on='user_screen_name')[['user_screen_name', 'user_description']]

top_30_with_descriptions.drop_duplicates()

Let's look at the tweets of the most active account with 'boolean filtering'.

In [None]:
top_user = top_30_with_descriptions['user_screen_name'].iloc[0]

# print(top_user)

df[df['user_screen_name'] == top_user][['local_time','user_screen_name', 'text']]

Let's now look at tweets over time

In [None]:
df.groupby(df["local_time"].dt.date)['id'].count().plot(kind="bar", figsize=(15,5))

There was a lot of activity on certain days. Let's look closer with 'boolean filtering'.

In [None]:
pd.set_option('max_colwidth', 1000)

df[(df['local_time'] >= '2023-01-25') & (df['local_time'] < '2023-01-26')][['text']]

And now, to have some kind of network analysis at least, let's look at who got the most mentions

In [None]:
mentioned_names = df[['mentioned_names']]

expanded = mentioned_names['mentioned_names'].str.split('|', expand=True)

expanded

In [None]:
counts = pd.Series()

for column in expanded.columns:
    counts_new = expanded.groupby(column)[column].count()
    # print(counts_new)
    counts = pd.concat([counts, counts_new])

# print(counts)

most_mentioned = counts.groupby(counts.index).sum().sort_values(ascending=False)

most_mentioned

In [None]:
most_mentioned[100::-1].plot(kind='barh', figsize=(5,15))

# Thanks for your attention! Any Questions?

Ask now or @flxvctr(@mas.to) on Twitter or Mastodon.