In this notebook, we perform the following steps:

1. **Load and Match Articles from Factiva**: 
   - We begin by downloading 198 RTF files of Capital articles from Factiva. These RTF files are then converted to TXT format. After the conversion, we load the articles from the TXT files and match them with their metadata.

2. **Load and Match Articles from LexisNexis**: 
   - Similarly, we download 164 RTF files of Capital articles from LexisNexis. These RTF files are converted to TXT format, and the articles are then loaded and matched with their sentiment annotations.

3. **Combine All Articles**: 
   - Finally, we combine all these articles into one dataset and save it as a CSV file.

## Media Tenor dataset

To match the articles downloaded from Factiva and LexisNexis with their metadata from the Media Tenor dataset, we first need to load the Media Tenor dataset. We only retain articles with non-empty titles, as it is not possible to identify and download articles without titles.

In [1]:
import pandas as pd

# Load the dataset acquired from Media Tenor
sentiment_data = pd.read_csv('Daten_Wirtschaftliche_Lage.csv', encoding='utf-8', sep=';')

# Filter out rows with empty titles, as we cannot identify and download the articles without titles
sentiment_data = sentiment_data[sentiment_data['title'].notnull()]

# Reset the index of the DataFrame
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the DataFrame
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0


The titles in the Media Tenor dataset were manually entered, leading to potential inconsistencies in punctuation and spacing. To address this issue and ensure accurate matching with the titles of the articles we download from databases, we normalize the titles in the dataset.

In [2]:
# Import the Normalize class from the normalize module
from normalize import Normalize

# Initialize the Normalize class with the titles from the 'sentiment_data' DataFrame
normalizer = Normalize(sentiment_data.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the sentiment_data DataFrame as a new column 'title_clean'
sentiment_data['title_clean'] = normalized_titles

sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0,koalition
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100,habt bloß keine angst vor china
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100,wir leben in einer zeit der wohlstands halluzi...
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25,teheran ruft
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0,geht es und wirklich so gut wie es uns merkel ...


We need to focus on annotated articles from Capital related to business cycle conditions, as these are the specific articles we downloaded from the databases.

In [3]:
# Filter the dataset to include only articles from Capital
sentiment_data = sentiment_data[sentiment_data['medium'] == 'Capital']

# Reset the index of the DataFrame and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Further filter the dataset to include only articles related to the business cycle conditions (Konjunktur)
sentiment_data = sentiment_data[sentiment_data['topicgroup'] == 'Konjunktur']

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

We filter the Media Tenor dataset to only keep articles where there was agreement between annotators on sentiment. Articles without annotator agreement (i.e., where `sentiment` is `NaN`) are removed.

In [4]:
from sentiment import sentiment

# Apply the 'sentiment' function to each row of the DataFrame and create a new 'sentiment' column
sentiment_data['sentiment'] = sentiment_data.apply(lambda row: sentiment(row), axis=1)

# Remove articles where there is no annotator agreement (i.e., sentiment is NaN)
sentiment_data = sentiment_data[sentiment_data['sentiment'].notnull()]

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the filtered DataFrame to verify the results
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean,sentiment
0,14.12.2017,201712,Capital,Man kann mit großen Hoffnungen,Konjunktur,0,1,0,1,0,man kann mit großen hoffnungen,0.0
1,14.12.2017,201712,Capital,SPONSOREN MEIDEN FIFA,Konjunktur,1,0,0,1,-100,sponsoren meiden fifa,-1.0
2,14.12.2017,201712,Capital,ZUKUNFT IST VERGANGENHEIT,Konjunktur,0,0,4,4,100,zukunft ist vergangenheit,1.0
3,15.02.2018,201802,Capital,JOES TAUSCHBÖRSE,Konjunktur,0,0,1,1,100,joes tauschbörse,1.0
4,15.02.2018,201802,Capital,Notenbanker sind Gefangene der Inflation,Konjunktur,4,1,1,6,-50,notenbanker sind gefangene der inflation,-1.0


## Load and Match Articles from Factiva

Next, we focus on loading Capital articles downloaded from Factiva and matching them with their metadata. In our first step, we convert the RTF files into TXT format. All the RTF files are stored in `MediaTenor_LexisNexis_Factiva/Capital_Konjunktur_Factiva_rtf`. The converted TXT files are stored in `MediaTenor_LexisNexis_Factiva/Capital_Konjunktur_Factiva_txt`.

In [5]:
import os

# Import the function for converting RTF to TXT
from convert_rtf_to_txt import convert_rtf_to_txt

# Define paths for Capital RTF and TXT directories
capital_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Capital_Konjunktur_Factiva_rtf')
capital_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Capital_Konjunktur_Factiva_txt')

# Convert RTF files to TXT format for Capital
convert_rtf_to_txt(capital_rtf_path, capital_txt_path)

As soon as the RTF files were transformed into TXT format, we made a few changes to the TXT files. Specifically, we corrected several titles to ensure accurate spelling and punctuation, which is important for matching them with the metadata from the Media Tenor dataset. For example, a title "Gut **möglich ,**" was corrected to "Gut **möglich,**".

Additionally, a few articles were compilations of multiple pieces. In such cases, we manually selected the annotated article from the compilation.

Once the TXT files were ready, we used the function `extract_article_data_capital_factiva` to load the text of the articles along with the journal's name, date of publication, title, and file name into a dictionary called `article_data`.

In [6]:
import extract_article_data_capital_factiva

# Read and extract relevant information from TXT files in Capital directory.
article_data = extract_article_data_capital_factiva.extract_article_data_capital_factiva(capital_txt_path)

We use the `article_data` dictionary to create a DataFrame `capital_factiva` that includes columns for the journal's name, publication date (day, month, and year), article title, text, and file name.

In [7]:
# Create a DataFrame from the collected data
capital_factiva = pd.DataFrame({
    'journal': article_data['journal'],
    'day': article_data['day'],
    'month': article_data['month'],
    'year': article_data['year'],
    'title': article_data['title'],
    'text': article_data['text'],
    'file': article_data['file']
})

capital_factiva.head()

Unnamed: 0,journal,day,month,year,title,text,file
0,Capital,20,Mai,2020,LASS ANGELN GEHEN,Aktien Trotz schlechter Wirtschaftsdaten feier...,Factiva-20200811-1029.txt
1,Capital,20,Mai,2020,""" Frau Wagenknecht , wird das eine Krise des ...","Der Neustart ""Bisher hat der Kapitalismus all ...",Factiva-20200811-1034.txt
2,Capital,20,Mai,2020,""" UNSICHERHEIT IM QUADRAT """,Interview Der US-Starinvestor und Chefberater ...,Factiva-20200811-1035 (1).txt
3,Capital,20,Mai,2020,BENKOS BEBEN,Signa Lange ging es für Warenhauskönig René Be...,Factiva-20200811-1035.txt
4,Capital,20,Mai,2020,WESTERN VON GESTERN,"Die Wirtschaft ist voller Skandale, Fehden und...",Factiva-20200811-1038.txt


To match the full texts of the loaded articles with their sentiment annotations from the Media Tenor dataset, we follow several key steps. First, we create a date in the same format as in the `sentiment_data` DataFrame. Next, we normalize the titles to ensure accurate matching. We also remove any duplicate articles that were mistakenly downloaded twice. After pre-processing, we merge the articles loaded from Factiva with their sentiment annotations from the Media Tenor dataset. We then sort the final DataFrame `data_match_factiva` in chronological order and retain only the relevant columns. Through this process, we successfully matched **198** Capital articles from Factiva with their sentiment annotations.

In [8]:
# Create dictionary to transform month name into month number
name_to_number = {
    u'Januar': '01', u'Februar': '02', u'M\xe4rz': '03', u'April': '04', u'Mai': '05',
    u'Juni': '06', u'Juli': '07', u'August': '08', u'September': '09', u'Oktober': '10',
    u'November': '11', u'Dezember': '12'
}

# Transform month names into month numbers
capital_factiva['month_num'] = capital_factiva['month'].map(name_to_number)

# Create dictionary to transform single-digit day numbers
day_transform = {u'1': '01', u'2': '02', u'3': '03', u'4': '04', u'5': '05', u'6': '06', u'7': '07', u'8': '08', u'9': '09'}

# Transform single-digit day numbers into two-digit format
capital_factiva['day'] = capital_factiva['day'].map(lambda d: day_transform.get(d, d))

# Combine day, month, and year into a date string
capital_factiva['date'] = capital_factiva.apply(lambda row: f"{row['day']}.{row['month_num']}.{row['year']}", axis=1)

# Drop duplicated articles, keeping the first occurrence
# We have duplicates because we mistakenly downloaded the same article twice
capital_factiva = capital_factiva.drop_duplicates(['text', 'year', 'month', 'day'], keep='first')

# Reset the index of the DataFrame
capital_factiva = capital_factiva.reset_index(drop=True)

# Initialize the Normalize class with the titles from the 'capital_factiva' DataFrame
normalizer = Normalize(capital_factiva.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the 'capital_factiva' DataFrame as a new column 'title_clean'
capital_factiva['title_clean'] = normalized_titles

# Merge with sentiment_data on title_clean and date
data_match_factiva = pd.merge(sentiment_data, capital_factiva, how='inner', on=['title_clean', 'date'])

# Rename the 'month_num' column to 'month'
data_match_factiva = data_match_factiva.rename(columns={'month_num': 'month'})

# Rename the 'year_y' column to 'year'
data_match_factiva = data_match_factiva.rename(columns={'year_y': 'year'})

# Convert year, month, and day to integers
data_match_factiva['year'] = data_match_factiva['year'].astype(int)
data_match_factiva['month'] = data_match_factiva['month'].astype(int)
data_match_factiva['day'] = data_match_factiva['day'].astype(int)

# Sort the data in chronological order
data_match_factiva = data_match_factiva.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match_factiva = data_match_factiva.reset_index(drop=True)

# Rename the 'title_y' column to 'title' to reflect the title from the Factiva dataset
data_match_factiva = data_match_factiva.rename(columns={'title_y': 'title'})

# Select only the required columns
data_match_factiva = data_match_factiva[['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']]

# Print the number of articles from Factiva
num_factiva_articles = len(data_match_factiva)

print(f"Number of articles from Factiva: {num_factiva_articles}")

# Display the first few rows of the final matched dataset
data_match_factiva.head()

Number of articles from Factiva: 198


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,Capital,18,9,2014,LASST UNS DIE ZUKUNFT BAUEN !,Konjunktur Die Deutschen genießen ihren Aufsch...,-1.0,Factiva-20200811-1348 (2).txt,lasst uns die zukunft bauen
1,Capital,18,9,2014,HINTERTÜR IM FERNEN OSTEN,Unsere Sanktionen gegen Russland werden wirkun...,1.0,Factiva-20200811-1349.txt,hintertür im fernen osten
2,Capital,18,9,2014,"Gute Sparer , schlechte Anleger","Wenn es um Vermögensbildung geht, sind die Deu...",-1.0,Factiva-20200811-1351.txt,gute sparer schlechte anleger
3,Capital,23,10,2014,""" Im Osten muss sich das Unternehmertum erst w...",Deutsche Einheit Kann man die Lehren aus der d...,-1.0,Factiva-20200811-1344 (2).txt,im osten muss sich das unternehmertum erst wie...
4,Capital,23,10,2014,WIRTSCHAFTSKARTE,KONJUNKTUR RUND UM DEN GLOBUS Nach einem guten...,-1.0,Factiva-20200811-1345.txt,wirtschaftskarte


## Load and Match Articles from LexisNexis

In this section, we aim to load articles from Capital that were downloaded from LexisNexis and match them with their sentiment annotations. We begin by converting the RTF files into TXT format. The original RTF files are located in `MediaTenor_LexisNexis_Factiva/Capital_Konjunktur_LexisNexis_rtf`, and the resulting TXT files are stored in `MediaTenor_LexisNexis_Factiva/Capital_Konjunktur_LexisNexis_txt`.

In [9]:
# Define paths for Capital RTF and TXT directories
capital_lexisnexis_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Capital_Konjunktur_LexisNexis_rtf')
capital_lexisnexis_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Capital_Konjunktur_LexisNexis_txt')

# Convert RTF files to TXT format for Capital
convert_rtf_to_txt(capital_lexisnexis_rtf_path, capital_lexisnexis_txt_path)

Once the RTF files were converted to TXT format, we made some adjustments. Specifically, we corrected a few lead-ins in the articles due to punctuation issues. For example, a lead-in starting from "**Eon/RWERWE-Chef** Peter Terium verteidigt" was corrected to "**Eon/RWE. RWE-Chef** Peter Terium verteidigt" to ensure accurate formatting.

After preparing the TXT files, we used the `extract_article_data_capital_lexisnexis` function to load the articles' text, along with the journal name, publication date, title, and file name, into a dictionary called a`article_data_lexisnexis`.

In [10]:
import extract_article_data_capital_lexisnexis

# Read and extract relevant information from TXT files in Capital directory.
article_data_lexisnexis = extract_article_data_capital_lexisnexis.extract_article_data_capital_lexisnexis(capital_lexisnexis_txt_path)

We use the `article_data_lexisnexis` dictionary to create a DataFrame `capital_lexisnexis` that includes columns for the journal's name, publication date (day, month, and year), article title, text, and file name.

In [11]:
# Create a DataFrame from the collected data
capital_lexisnexis = pd.DataFrame({
    'journal': article_data_lexisnexis['journal'],
    'day': article_data_lexisnexis['day'],
    'month': article_data_lexisnexis['month'],
    'year': article_data_lexisnexis['year'],
    'title': article_data_lexisnexis['title'],
    'text': article_data_lexisnexis['text'],
    'file': article_data_lexisnexis['file']
})

capital_lexisnexis.head()

Unnamed: 0,journal,day,month,year,title,text,file
0,Capital,24,October,2013,"""20 Rivalen""","G20. ""Wir verpflichten uns zum automatischen D...",20 RIVALEN. EIN JAHR. EIN SATZ. EINE SENSATION...
1,Capital,18,June,2014,"""Hier geht's ab!""",DYNAMISCHSTE STÄDTE. Die Wirtschaft brummt in ...,23 Hier geht_s ab_.txt
2,Capital,18,June,2014,"""Mister 14 Prozent""",BESTER INVESTOR. Frank Hansen ist Spezialist f...,27 MISTER 14 PROZENT.txt
3,Capital,15,December,2011,"""Abgesoffen""",Computerindustrie. Weltweit explodieren die Pr...,Abgesoffen.txt
4,Capital,19,July,2012,"""Aktionsplan Schiene""",Deutsche Bahn. Der Konzern verdient mit der Be...,Aktionsplan Schiene.txt


To match the full texts of the loaded articles with their sentiment annotations from the Media Tenor dataset, we follow several key steps. First, we create a date in the same format as in the `sentiment_data` DataFrame. Next, we normalize the titles to ensure accurate matching. We also verify that there are no duplicate articles. After pre-processing, we merge the articles loaded from LexisNexis with their sentiment annotations from the Media Tenor dataset. We then sort the final DataFrame `data_match_lexisnexis` in chronological order and retain only the relevant columns. Through this process, we successfully matched **164** Capital articles from LexisNexis with their sentiment annotations.

In [12]:
# Create dictionary to transform month name into month number
name_to_number = {
    u'Januar': '01', u'Februar': '02', u'M\xe4rz': '03', u'April': '04', u'Mai': '05',
    u'Juni': '06', u'Juli': '07', u'August': '08', u'September': '09', u'Oktober': '10',
    u'November': '11', u'Dezember': '12',
    u'January': '01', u'February': '02', u'March': '03',
    u'May': '05', u'June': '06', u'July': '07',
    u'October': '10', u'December': '12'
}

# Transform month names into month numbers
capital_lexisnexis['month_num'] = capital_lexisnexis['month'].map(name_to_number)

# Transform single-digit day numbers into two-digit format
capital_lexisnexis['day'] = capital_lexisnexis['day'].map(lambda d: day_transform.get(d, d))

# Combine day, month, and year into a date string
capital_lexisnexis['date'] = capital_lexisnexis.apply(lambda row: f"{row['day']}.{row['month_num']}.{row['year']}", axis=1)

# Initialize the Normalize class with the titles from the 'capital_lexisnexis' DataFrame
normalizer = Normalize(capital_lexisnexis.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the 'capital_lexisnexis' DataFrame as a new column 'title_clean'
capital_lexisnexis['title_clean'] = normalized_titles

# Merge with sentiment_data on title_clean and date
data_match_lexisnexis = pd.merge(sentiment_data, capital_lexisnexis, how='inner', on=['title_clean', 'date'])

# Rename the 'month_num' column to 'month'
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'month_num': 'month'})

# Rename the 'year_y' column to 'year'
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'year_y': 'year'})

# Convert year, month, and day to integers
data_match_lexisnexis['year'] = data_match_lexisnexis['year'].astype(int)
data_match_lexisnexis['month'] = data_match_lexisnexis['month'].astype(int)
data_match_lexisnexis['day'] = data_match_lexisnexis['day'].astype(int)

# Sort the data in chronological order
data_match_lexisnexis = data_match_lexisnexis.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match_lexisnexis = data_match_lexisnexis.reset_index(drop=True)

# Rename the 'title_y' column to 'title' to reflect the title from the LexisNexis dataset
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'title_y': 'title'})

# Select only the required columns
data_match_lexisnexis = data_match_lexisnexis[['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']]

# Print the number of articles from LexisNexis
num_lexisnexis_articles = len(data_match_lexisnexis)

print(f"Number of articles from LexisNexis: {num_lexisnexis_articles}")

# Display the last few rows of the final matched dataset
data_match_lexisnexis.tail()

Number of articles from LexisNexis: 164


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
159,Capital,24,7,2014,"""Der Mythos vom Staat, der nur stört""","ESSAY. Wer etwas riskiert, erntet die Früchte,...",1.0,DER MYTHOS VOM STAAT_ DER NUR ST_RT.txt,der mythos vom staat der nur stört
160,Capital,24,7,2014,"""Die mächtigste Zahl der wirtschaft""",Das Bruttoinlandsprodukt ist der Puls unserer ...,1.0,DIE M_CHTIGSTE ZAHL DER WIRTSCHAFT.txt,die mächtigste zahl der wirtschaft
161,Capital,24,7,2014,"""Die verdrängte Schuldenkrise""",WELT DER WIRTSCHAFT. Seit Jahren reden alle üb...,-1.0,DIE VERDR_NGTE SCHULDENKRISE.txt,die verdrängte schuldenkrise
162,Capital,21,8,2014,"""Hier schlägt das herz der französischen Wirts...",FRANKREICHS WIRTSCHAFT LAHMT. DOCH EIN PROVINZ...,-1.0,Hier schl_gt das Herz der franz_sischen Wirtsc...,hier schlägt das herz der französischen wirtsc...
163,Capital,21,8,2014,"""Das Desaster der Master""",Banken im Umbruch: GeschäftsmodellDie Finanzwe...,0.0,DAS DESASTER DER MASTER.txt,das desaster der master


## Combine All Articles

As the final step, we combine all Capital articles, specifically those downloaded from Factiva and LexisNexis, into a single DataFrame called `capital_all`. This combined DataFrame is then saved as a CSV file named `capital.csv`.

In [13]:
# Combine all articles from Factiva and LexisNexis datasets into a single DataFrame
capital_all = pd.concat([data_match_factiva, data_match_lexisnexis], sort=False)

# Reset the index of the combined DataFrame
capital_all = capital_all.reset_index(drop=True)

# Print the total number of articles in the combined DataFrame
total_articles = len(capital_all)
print(f"Total number of articles: {total_articles}")

# Display the first few rows of the combined DataFrame to verify the merge
capital_all.head()

Total number of articles: 362


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,Capital,18,9,2014,LASST UNS DIE ZUKUNFT BAUEN !,Konjunktur Die Deutschen genießen ihren Aufsch...,-1.0,Factiva-20200811-1348 (2).txt,lasst uns die zukunft bauen
1,Capital,18,9,2014,HINTERTÜR IM FERNEN OSTEN,Unsere Sanktionen gegen Russland werden wirkun...,1.0,Factiva-20200811-1349.txt,hintertür im fernen osten
2,Capital,18,9,2014,"Gute Sparer , schlechte Anleger","Wenn es um Vermögensbildung geht, sind die Deu...",-1.0,Factiva-20200811-1351.txt,gute sparer schlechte anleger
3,Capital,23,10,2014,""" Im Osten muss sich das Unternehmertum erst w...",Deutsche Einheit Kann man die Lehren aus der d...,-1.0,Factiva-20200811-1344 (2).txt,im osten muss sich das unternehmertum erst wie...
4,Capital,23,10,2014,WIRTSCHAFTSKARTE,KONJUNKTUR RUND UM DEN GLOBUS Nach einem guten...,-1.0,Factiva-20200811-1345.txt,wirtschaftskarte


In [14]:
# Sort the combined DataFrame in chronological order
capital_all = capital_all.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
capital_all = capital_all.reset_index(drop=True)

# Drop the 'title_clean' column as it is no longer needed
capital_all = capital_all.drop(columns=['title_clean'])

# Save the combined DataFrame to a CSV file
capital_all.to_csv('capital.csv', encoding='utf-8-sig', sep=';')