# Exploring the data

- What information is in the data?
- How much information do we have?
- What language is spoken?
- Is it complete? Are there missing values?
- Which information is valuable? Which can be neglected?


## Reading in data

Using [pandas dataframe](https://pandas.pydata.org/) for csv files

In [None]:
import pandas as pd

In [None]:
path = './data/parliamentary-questions_fulltext_2023.csv'
data = pd.read_csv(path, index_col=1)

In [None]:
data

In [None]:
# Number of rows in the dataset
len(data)

In [None]:
# Names of the columns
data.columns

In [None]:
# Exploring unique values of a column
data.document_language.unique()

In [None]:
data.document_type.unique()

In [None]:
# Looking at the document titles
data.document_title.tolist()

In [None]:
# Explore the texts
sample_question = data.question_text.values[0]
print(sample_question)

In [None]:
sample_answer = data.answer_text.values[10]
print(sample_answer)

In [None]:
# Looking for missing values
data.info()

In [None]:
# Reducing the data to the most valuable information
cols = ['document_identifier', 'document_title', 'document_type', 'document_date', 'question_text', 'answer_text']
df = data[cols]
df

In [None]:
# Storing the data for processing
df.to_csv('./data/parliamentary-questions_2023_sample.csv')