# Data Wrangling Exercises


## Introduction

Data wrangling is the process of cleaning, transforming, and organizing data to make it more suitable for analysis. It is a critical step in any data analysis project, as it ensures that the data is accurate, consistent, and complete.

These exercises are designed to provide practice in data wrangling skills using a real-world dataset. The dataset used in these exercises is the Slovenian Natural Language Inference dataset (SI-NLI), which contains labeled examples of text pairs with corresponding labels of entailment, contradiction, or neutral.

The exercises cover a range of data wrangling techniques, including importing data, performing basic statistics, subsetting observations and variables, creating new variables, grouping data, and combining datasets.

## Get data

1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).
2. Load libraries.
3. Import ```train.tsv``` file.

In [23]:
import pandas as pd
import numpy as np

In [24]:
auto = pd.read_csv('./train.tsv',sep="\t")
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4392 entries, 0 to 4391
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   pair_id           4392 non-null   object
 1   premise           4392 non-null   object
 2   hypothesis        4392 non-null   object
 3   annotation_1      4301 non-null   object
 4   comment_1         43 non-null     object
 5   annotator1_id     4301 non-null   object
 6   annotation_2      4301 non-null   object
 7   comment_2         118 non-null    object
 8   annotator2_id     4301 non-null   object
 9   annotation_3      411 non-null    object
 10  comment_3         8 non-null      object
 11  annotator3_id     416 non-null    object
 12  annotation_FINAL  3884 non-null   object
 13  label             4392 non-null   object
dtypes: object(14)
memory usage: 480.5+ KB


## Basic statistics

1. How many examples are in a dataframe? 4392
2. How many variables are in a dataframe? 14
3. Count values in the ```label``` column. 4392
4. Are there any missing values in the data? yes
5. Count the number of missing values per column.

In [None]:
len(auto)
auto.info()
auto.describe()

## Subset observations and variables

1. Select ```premise``` column and store it in a list.
2. Print first 3 rows from the first 3 columns.
3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.
4. Drop ```pair_id``` column.
5. Convert all columns to uppercase.
6. Replace ```_``` with ```-``` in column names.
7. Select rows that belong to the ```neutral``` label.
8. Select last 30 rows.
9. Select rows with ```hypothesis``` longer than 100 characters.
10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.
11. Select the row with the longest ```hypothesis```.
12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.
13. Remove rows that contain at least one missing value.
14. Remove the column with the most missing values.

In [None]:
a = auto["premise"]
a

In [None]:
auto.iloc[0:3,0:3]

In [42]:
train_dataset = auto.loc[:,["pair_id", "premise", "hypothesis", "label"]]
train_dataset.drop(columns="pair_id")
train_dataset.columns = train_dataset.columns.str.upper()
train_dataset.columns = train_dataset.columns.str.replace('_','-')
train_dataset

Unnamed: 0,PAIR-ID,PREMISE,HYPOTHESIS,LABEL
0,P0,Vendar se je anglikanska večina v grofijah na ...,A na glasovanju o priključitvi ozemlja k Sever...,entailment
1,P1,INŠTRUKTOR IZ PRTLJAŽNIKA V DRUGO POTOVALKO PR...,Učitelj je vzel iz prtljažnika iz prve potoval...,contradiction
2,P2,biotska raznovrstnost – v splošnem je to razno...,Četudi je biodiverziteta pomemben del biološke...,contradiction
3,P3,"Preroški pomen: Če v sanjah bedite, je to na s...",V preroškem smislu budnost v sanjah nakazuje o...,entailment
4,P4,"Jeseni so dnevi krajši, stemni se že dokaj zgo...",V krajših jesenskih dneh tema nastopi relativn...,entailment
...,...,...,...,...
4387,P4387,"Kot je povedal, naj bi novi načrt podpisal tak...",Sam je brez zadržkov pojasnil situacijo in raz...,neutral
4388,P4388,"Kot je povedal, naj bi novi načrt podpisal tak...",Sam se na vprašanje o podpisu novih načrtov ni...,contradiction
4389,P4389,Večina delničarjev ga je potrdila za predsedni...,Banka Karantanija je po izboru večine delničar...,entailment
4390,P4390,Večina delničarjev ga je potrdila za predsedni...,Trdo delo Janeza Pestotnika se mu je končno ob...,neutral


## Create new variables

1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.
2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.
3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.

## Save dataframes

1. Save the original dataset to disk in a ```csv``` format.
...