## Flüchtlingskrise Sentiment Analysis
### Emily Martin, eem80@pitt.edu

In [4]:
# Importing necessary libraries
import pandas as pd
import pickle
import nltk

## The data
#### Shape and acquisition
- Using the four scripts in my repo: [Süddeutsche_zeitung](https://github.com/Data-Science-for-Linguists-2021/Fluechtlingskrise-Sentiment-Analysis/blob/main/Süddeutsche_zeitung.ipynb), [taz](https://github.com/Data-Science-for-Linguists-2021/Fluechtlingskrise-Sentiment-Analysis/blob/main/taz.ipynb), [zeit](https://github.com/Data-Science-for-Linguists-2021/Fluechtlingskrise-Sentiment-Analysis/blob/main/zeit.ipynb) and [Junge Freiheit](https://github.com/Data-Science-for-Linguists-2021/Fluechtlingskrise-Sentiment-Analysis/blob/main/Junge%20Freiheit.ipynb) I was able to scrape the sites for news articles from 2015 using the search terms 'Flüchtling' (refugee) and/or 'Migranten' (migrants). 
- The actual number of articles varies widely per site because of ease of scraping and simply overall newspaper size. For Die TAZ there are 100 articles, from manually compiled links, for  Der Zeit there are 573, from links collected through their API, for Der Süddeutsche Zeitung there are 982 and for Junge Freiheit there are 60. 
- After collecting these articles in the scripts I made them into dataframes which I then pickled. However these pickled files are not available through my repo due to copywrite.

### A quick look at each source

#### Der Zeit
- Der Zeit is one of the largest weekly newspapers in Germany, it is centrist/liberal in its political leanings and kindly supports an API.

In [18]:
# Unpickle the dataframes
zeit_df = pd.read_pickle("zeit_df.pkl")

print(zeit_df.shape)
zeit_df.head()

(573, 5)


Unnamed: 0,title,href,text,release_date,word_count
0,Mahmood im Schilderwald,http://www.zeit.de/2015/51/fuehrerschein-fluec...,Als er vor über zehn Jahren Autofahren gelernt...,2015-12-31T02:51:37Z,1175
1,Zwei zähe Einzelgänger,http://www.zeit.de/2015/51/vorbereitung-auf-da...,"Wo Zou Lei herkommt, ist das Leben nicht leich...",2015-12-31T01:56:01Z,1125
2,Fortsetzung folgt – jetzt,http://www.zeit.de/2016/01/geschichten-2015-fo...,"Lok Leipzig ist ratlos, was aus Mario Basler w...",2015-12-30T09:00:08Z,313
3,Anhaltend hohe Flüchtlingszahlen auf Balkanroute,http://www.zeit.de/gesellschaft/2015-12/slowen...,Auch zum Jahresende kommen weiter täglich Taus...,2015-12-29T22:14:02Z,362
4,Laut Özoğuz schürt Union Vorurteile gegen Flüc...,http://www.zeit.de/politik/deutschland/2015-12...,Opposition und Koalitionspartner kritisieren d...,2015-12-29T08:24:55Z,379


In [16]:
# A little about this dataframe:
zeit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573 entries, 0 to 572
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         573 non-null    object
 1   href          573 non-null    object
 2   text          573 non-null    object
 3   release_date  573 non-null    object
 4   word_count    573 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 22.5+ KB


In [31]:
zeit_df.describe()
# min of 1, there is at least one article where the link was broken

Unnamed: 0,word_count
count,573.0
mean,562.750436
std,237.382328
min,1.0
25%,384.0
50%,551.0
75%,697.0
max,1945.0


### Die TAZ
-  Die TAZ (Die Tageszeitung) is a daily German newspaper with a modest circulation, it leans left-wing/green and is the most left-ist of the sources. 

In [19]:
# Unpickle and a quick look at the dataframe
taz_df = pd.read_pickle("taz_df.pkl")
print(taz_df.shape)
taz_df.head()

(100, 4)


Unnamed: 0,href,text,word_count,date
0,https://taz.de/Essay-Journalismus-und-Zuwander...,Deutschland hat sich verändert. Die Redaktione...,1232,2015-12-31
1,https://taz.de/Fluechtlingsdebatte-in-den-USA/...,Nach den Anschlägen von Paris wollen nur noch ...,786,2015-11-17
2,https://taz.de/Kommentar-Verfassungsschutz/!50...,Die Reform des V-Leute-Wesens ist eine Charmeo...,266,2015-03-25
3,https://taz.de/NPD-Invasion-in-Fluechtlingsunt...,NPD-Landtagsabgeordnete besuchten eine Erstauf...,561,2015-09-28
4,https://taz.de/Misshandlung-von-Fluechtlingen-...,Per Referendum will Premier Orbán rechtswidrig...,471,2015-04-28


In [20]:
taz_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   href        100 non-null    object
 1   text        100 non-null    object
 2   word_count  100 non-null    int64 
 3   date        100 non-null    object
dtypes: int64(1), object(3)
memory usage: 3.2+ KB


In [21]:
taz_df.describe()
# No broken links/problem areas

Unnamed: 0,word_count
count,100.0
mean,799.43
std,458.021332
min,209.0
25%,495.0
50%,674.0
75%,948.0
max,2846.0


### Der Süddeutsche Zeitung
- Der Süddeustche Zeitung is a daily newspaper with a very wide ciruclation (second largest after Der Zeit), it leans left-liberal.

In [22]:
# Unpickle and aa quick look at the data
sz_df = pd.read_pickle("sz_df.pkl")
print(sz_df.shape)
sz_df.head()

(1000, 4)


Unnamed: 0,href,text,word_count,date
0,https://www.sueddeutsche.de/politik/migration-...,Berlin (dpa) - Die Bundesländer haben für die ...,89,"27. Dezember 2015, 2:45 Uhr"
1,https://www.sueddeutsche.de/politik/migration-...,Rom (dpa) - Im Mittelmeer vor Italien sind auc...,62,"26. Dezember 2015, 20:51 Uhr"
2,https://www.sueddeutsche.de/kultur/rueckblick-...,1 / 12 Quelle: 20th Century Fox Südseefilme si...,1818,"26. Dezember 2015, 17:57 Uhr"
3,https://www.sueddeutsche.de/politik/rueckblick...,Bei dem Blick zurück auf das Jahr 2015 stechen...,451,"26. Dezember 2015, 16:00 Uhr"
4,https://www.sueddeutsche.de/politik/fluechtlin...,Nach einem Brandanschlag auf eine noch nicht f...,387,"26. Dezember 2015, 15:43 Uhr"


In [30]:
sz_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   href        1000 non-null   object
 1   text        1000 non-null   object
 2   word_count  1000 non-null   int64 
 3   date        982 non-null    object
dtypes: int64(1), object(3)
memory usage: 31.4+ KB


In [24]:
sz_df.describe()
# There were 18 broken links
# Fairly short articles

Unnamed: 0,word_count
count,1000.0
mean,368.44
std,282.050192
min,1.0
25%,119.0
50%,331.5
75%,532.25
max,2402.0


### Junge Freiheit
-  Junge Freiheit is a small weekly newspaper with fairly strong right-wing leanings

In [32]:
# Unpickle and a quick look at the data
jf_df = pd.read_pickle("jf_df.pkl")
print(jf_df.shape) # This is the smallest sample
jf_df.head()

(60, 4)


Unnamed: 0,href,text,word_count,date
0,https://jungefreiheit.de/politik/deutschland/2...,POTSDAM. Brandenburgs AfD-Chef Alexander Gaula...,385,18. November 2015
1,https://jungefreiheit.de/debatte/kommentar/201...,Die Norwegerin Linda Hagen ist immer noch ganz...,171,05. November 2015
2,https://jungefreiheit.de/politik/deutschland/2...,"ERFURT. Asylbewerber, die mit der Deutschen Ba...",191,04. November 2015
3,https://jungefreiheit.de/politik/ausland/2015/...,TRIPOLIS. Der libysche „Allgemeine Volkskongre...,262,04. November 2015
4,https://jungefreiheit.de/politik/deutschland/2...,Dreizehn Regierungschefs beschließen auf einem...,729,01. November 2015


In [33]:
jf_df.info()
# All non-null, which is good. Can't really afford to lose more articles from this source

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   href        60 non-null     object
 1   text        60 non-null     object
 2   word_count  60 non-null     int64 
 3   date        60 non-null     object
dtypes: int64(1), object(3)
memory usage: 2.0+ KB


In [34]:
jf_df.describe()

Unnamed: 0,word_count
count,60.0
mean,461.666667
std,323.277373
min,126.0
25%,195.25
50%,356.0
75%,732.75
max,1348.0


## Sentiment Analysis