# Basic Data Wrangling with Pandas - Solutions
Pandas is an open-source Python library for data analysis and management. It provides data structures for table-like data as well as tools for data manipulation (e. g. filtering, merging, or sorting).

In this notebook, you will learn some of the basics of Pandas. **This notebook contains the solutions for the Basic Data Wrangling with Pandas notebook.**

A prerequisite for using this notebook is basic knowledge of Python programming.

#### References
The Pandas documentation (https://pandas.pydata.org/docs/) provides rich information should you get stuck.

Datasets based on https://paperswithcode.com/dataset/kinnews-and-kirnews.

## Imports

In [2]:
# Execute this cell to import Pandas
import pandas as pd

## Pandas DataFrames and Series

In [3]:
# Create a Series (pd.Series) from "mylist" (fill the list with whatever values you like)
mylist = [1, 2, 3, 4, 5]

myseries = pd.Series(mylist)

myseries

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
# Create a DataFrame (pd.DataFrame) from the "cities" dictionary
cities = {
    'name': ['Kigali', 'Gisenyi', 'Huye', 'Gitarama', 'Musanze', 'Byumba', 'Cyangugu', 'Kibuye', 'Rwamagana', 'Kibungo'],
    'population': [1132686, 136830, 89600, 87613, 86685, 70593, 63883, 48024, 47203, 46240]
}

cities_dataframe = pd.DataFrame(cities)

cities_dataframe

Unnamed: 0,name,population
0,Kigali,1132686
1,Gisenyi,136830
2,Huye,89600
3,Gitarama,87613
4,Musanze,86685
5,Byumba,70593
6,Cyangugu,63883
7,Kibuye,48024
8,Rwamagana,47203
9,Kibungo,46240


## Importing data

In [5]:
dataset_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_0_500.csv"

# Read the CSV file from the URL and store it in the variable "data"
data = pd.read_csv(dataset_url)

data

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
...,...,...,...
495,5,film ubuzima ni gatebe gato ifite inyigisho zu...,tina avuga yizeye filme abakunzi sinema nyarwa...
496,1,abanyarwanda batuye canada bibutse abazize jen...,igikorwa kwibuka kitabiriwe kandi n’abahagarar...
497,2,ku nshuro mbere u rwanda rubonye umusifuzi uru...,bwa mbere mateka y’u rwanda hagiye kuboneka um...
498,4,abaturage bakwiye kurushaho kubungabunga ubuzi...,mu bihe by’imvura aribyo tunarimo rwanda usang...


## Understanding your data

In [6]:
# Show the shape of the data
data.shape

(500, 3)

In [7]:
# Show the first 17 rows of your data
data.head(17)

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
5,1,saa niyo saha nziza gutera akabariro bashakanye,yifashishije science’ umwanditsi akaba n’inzob...
6,7,ibyo wamenya bifaru by’abarusiya bitabonwa rad...,igisirikare cy’igihugu cy’uburusiya cyamaze ku...
7,2,arsenal yirukanye unai emery n’abari bamwungirije,kuri gatanu nibwo ikipe arsenal yatangaje yama...
8,5,wizkid yatumiwe n’ umuryango perezida awukorer...,wizkid umwe bahanzi bakunzwe cyane nigeria bat...
9,2,basketball ikipe y’u rwanda u y’abahungu yatsi...,nubwo kipe yatsinzwe bigaragara yitwaye neza r...


## Filtering your data

In [10]:
# Select only the "title" column
data["title"]

0      ikipe y’ u rwanda amavubi yahesheje u rwanda a...
1      urubyiruko itorero erc giterane cy’ububyutse k...
2      rusizi bambaye udupfukamunwa n’ubwo bamwe bata...
3      abanyarwanda batatu begukanye ibihembo pam awards
4      light family choir igiye gukora igitaramo cy’a...
                             ...                        
495    film ubuzima ni gatebe gato ifite inyigisho zu...
496    abanyarwanda batuye canada bibutse abazize jen...
497    ku nshuro mbere u rwanda rubonye umusifuzi uru...
498    abaturage bakwiye kurushaho kubungabunga ubuzi...
499             ndererehe yafatiwe cyuho atetse kanyanga
Name: title, Length: 500, dtype: object

In [12]:
# Select the "title" and "label" columns, in that order
data[['title', 'label']]

Unnamed: 0,title,label
0,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,2
1,urubyiruko itorero erc giterane cy’ububyutse k...,11
2,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,4
3,abanyarwanda batatu begukanye ibihembo pam awards,5
4,light family choir igiye gukora igitaramo cy’a...,11
...,...,...
495,film ubuzima ni gatebe gato ifite inyigisho zu...,5
496,abanyarwanda batuye canada bibutse abazize jen...,1
497,ku nshuro mbere u rwanda rubonye umusifuzi uru...,2
498,abaturage bakwiye kurushaho kubungabunga ubuzi...,4


In [11]:
# Select rows 100-120 in the "content" column
data["content"][100:120]

100    ku werurwe itsinda ry’abashakashatsi kaminuza ...
101    perezida kagame yavuze impamvu yatumye abakuru...
102    abayobozi repubulika demukarasi kongo batangaj...
103    ingabo z’u rwanda ziri basirikare bahuriye mut...
104    irushanwa as kigali pre-season tournament riso...
105    umuyobozi w’umudugudu kageri kagari bukinanyan...
106    ubwo karere burera batangizaga icyumweru kwibu...
107    aba basore bakora akazi kubika gushyingura nez...
108    itsinda ryo kaminuza yitwa monash swinburne rm...
109    itangazo ryashyizwe ahagaragara minisiteri y’u...
110    abakinnyi batanu batoranyijwe harimo yaya tour...
111    ubwo bushakashatsi bwagaragaje inyoni bihugu z...
112    bahati prince mwene nkundakozera michel kazege...
113    igiciro lisansi cyari gisanzwe kiri mafaranga ...
114    kimwe n’indi myaka yashize mwaka - wakozwemo i...
115    mu biganiro yagiranye n’abanyamuhanga tariki n...
116    havugimana jonas utuye mudugudu kabahona kagar...
117    indirimbo bafatanije mpe

## Combining data from multiple sources

In [13]:
# Read the other dataset
dataset2_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_1000_1500.csv"

data2 = pd.read_csv(dataset2_url)

data2

Unnamed: 0,label,title,content
0,1,ari maboko polisi kubera gukoresha ibiyobyabwenge,umunyamabanga nshingwabikorwa w’akagari manwar...
1,1,abanyarwanda batuye canada bibutse abazize jen...,igikorwa kwibuka kitabiriwe kandi n’abahagarar...
2,5,umuhanzi fulgence nzira kugaruka ruhando rw’um...,mu kiganiro cyihariye twagiranye yadutangarije...
3,1,nsanzimana etienne yishwe anizwe anatewe ibyuma,nkuko bitangazwa n’inzego police karere rutsir...
4,2,sekamana maxime yahetse apr ayihesha amanota a...,byasabye ikipe apr fc gutegereza sekamana maxi...
...,...,...,...
495,1,ihohoterwa ntirikorerwa abagore gusa n’abagabo...,ibi bikaba biterwa hari abagore bamwe barumvis...
496,11,muco adonis wahimbye nzogera yavuze’ yashyize ...,ndategereje’ indirimbo muco adonis yakoze ugus...
497,1,abatangabuhamya rubanza munyenyezi bemerewe ku...,steven mcauliffe yatangajeko umwirondoro w’aba...
498,2,mayweather agiye guhura mcgregor mukino bise b...,ni umukino bamwe bise byendagusetsa ariko usho...


In [14]:
# Make (concatenate) one big DataFrame out of both datasets
all_data = pd.concat([data, data2])

all_data

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
...,...,...,...
495,1,ihohoterwa ntirikorerwa abagore gusa n’abagabo...,ibi bikaba biterwa hari abagore bamwe barumvis...
496,11,muco adonis wahimbye nzogera yavuze’ yashyize ...,ndategereje’ indirimbo muco adonis yakoze ugus...
497,1,abatangabuhamya rubanza munyenyezi bemerewe ku...,steven mcauliffe yatangajeko umwirondoro w’aba...
498,2,mayweather agiye guhura mcgregor mukino bise b...,ni umukino bamwe bise byendagusetsa ariko usho...


In [17]:
# Merge two tables using a common identifier

# First, retrieve the CSV file containing the text labels (categories such as sports, tourism, ...)
text_labels_file_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/labels.csv"
text_labels = pd.read_csv(text_labels_file_url)

# Now, merge
full_data_with_labels = pd.merge(all_data, text_labels, left_on='label', right_on='number')

full_data_with_labels

Unnamed: 0,label_x,title,content,number,label_y
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...,2,sport
1,2,arsenal yirukanye unai emery n’abari bamwungirije,kuri gatanu nibwo ikipe arsenal yatangaje yama...,2,sport
2,2,basketball ikipe y’u rwanda u y’abahungu yatsi...,nubwo kipe yatsinzwe bigaragara yitwaye neza r...,2,sport
3,2,- intashyo ikipe y’igihugu amavubi,-ese habuze kugeza magingo imyaka irihiritse a...,2,sport
4,2,ferwacy igiye gushyira amagare mashuri abanza ...,nk’uko emmanuel murenzi umunyamabanga uhoraho ...,2,sport
...,...,...,...,...,...
995,8,kigali serena hotel nyungwe forest lodge hotel...,mu gikorwa gutangaza uru rutonde cyabereye ser...,8,tourism
996,8,ba mukerarugendo bazashobora kwishyura bashaka...,ubu buryo buzanywe kompanyi yitwa expedia isan...,8,tourism
997,8,rdb yahaye nyabihu tourism ibikoresho bikoresh...,iyi koperative kandi yahawe ibitabo bikubiyemo...,8,tourism
998,8,intambara kongo ntabwo yigeze ihungabanya umut...,iyi kompanyi iratangaza gihe hari amakuru atan...,8,tourism


## Basic DataFrame operations

In [18]:
# Add a column indicating the language of the text
data['language'] = "Kinyarwanda"

data

Unnamed: 0,label,title,content,language
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...,Kinyarwanda
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...,Kinyarwanda
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...,Kinyarwanda
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...,Kinyarwanda
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...,Kinyarwanda
...,...,...,...,...
495,5,film ubuzima ni gatebe gato ifite inyigisho zu...,tina avuga yizeye filme abakunzi sinema nyarwa...,Kinyarwanda
496,1,abanyarwanda batuye canada bibutse abazize jen...,igikorwa kwibuka kitabiriwe kandi n’abahagarar...,Kinyarwanda
497,2,ku nshuro mbere u rwanda rubonye umusifuzi uru...,bwa mbere mateka y’u rwanda hagiye kuboneka um...,Kinyarwanda
498,4,abaturage bakwiye kurushaho kubungabunga ubuzi...,mu bihe by’imvura aribyo tunarimo rwanda usang...,Kinyarwanda


In [19]:
# Delete the column you have just added
data = data.drop(columns='language')

data

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
...,...,...,...
495,5,film ubuzima ni gatebe gato ifite inyigisho zu...,tina avuga yizeye filme abakunzi sinema nyarwa...
496,1,abanyarwanda batuye canada bibutse abazize jen...,igikorwa kwibuka kitabiriwe kandi n’abahagarar...
497,2,ku nshuro mbere u rwanda rubonye umusifuzi uru...,bwa mbere mateka y’u rwanda hagiye kuboneka um...
498,4,abaturage bakwiye kurushaho kubungabunga ubuzi...,mu bihe by’imvura aribyo tunarimo rwanda usang...


In [20]:
# Compute the mean length of the title and content columns (in characters)
title_mean_length = data['title'].str.len().mean()
content_mean_length = data['content'].str.len().mean()

print(f'The average length of the title in the data set is {title_mean_length} characters and of the content {content_mean_length} characters.')

The average length of the title in the data set is 63.406 characters and of the content 1847.848 characters.


In [22]:
# Select rows where the length of the title is below 25 characters
data[data['title'].str.len() < 25]

Unnamed: 0,label,title,content
134,5,pggss season yatangiye,abayobozi bralirwa n’aba east africa promotors...
186,1,akabaye icwende ntikoga,kuva nkundura y’amashyaka menshi u rwanda rwis...
305,5,pggss season yatangiye,abayobozi bralirwa n’aba east africa promotors...
335,4,ebola mugi goma,minisiteri y’ubuzima repubulika iharanira demu...


In [31]:
# Remove duplicates (defined as rows with the same title) from the data and check how many rows remain 
data.drop_duplicates('title')

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
...,...,...,...
491,5,rtd brig gen yemerewe kwiyamamariza umwanya ku...,akanama gashinzwe amatora y’umuyobozi ferwafa ...
492,3,abakozi batsinda leta yabishyuye miliyoni itsi...,raporo komisiyo ishinzwe abakozi leta igaragaz...
493,4,abaturage bakwiye kurushaho kubungabunga ubuzi...,mu bihe by’imvura aribyo tunarimo rwanda usang...
495,5,film ubuzima ni gatebe gato ifite inyigisho zu...,tina avuga yizeye filme abakunzi sinema nyarwa...


In [26]:
# Sort the data according to the label
data.sort_values(by='label', ascending=True)

Unnamed: 0,label,title,content
499,1,ndererehe yafatiwe cyuho atetse kanyanga,umunyamabanga nshingwabikorwa w’umurenge mayan...
118,1,ikibazo kubonera ibihugu abagizwe abere ictr g...,aganira n’ibiro by’itangazamakuru hirondelle e...
250,1,abahanzi basuye urwibutso jenoside ntarama n’u...,abahanzi batorewe gukomeza cyikiro kabiri bita...
252,1,abasenateri bazahagararira intara n’umujyi kig...,mu turere twose gihugu hazindukiye amatora y’a...
115,1,muhanga ihungabana rikomoka jenoside si iry’ab...,mu biganiro yagiranye n’abanyamuhanga tariki n...
...,...,...,...
44,14,ibintu bizonga umukobwa rukundo,umukobwa ariwe wese agiye rukundo asa n’utuye ...
398,14,naciye inyuma umugabo wanjye ndyamana n’umuhun...,mbigenze nte umuhungu nikururiye ansenyeye uru...
137,14,ibimenyetso bishobora kwereka mwashakanye agik...,hari igihe umuntu agira umukunzi buzima bagata...
309,14,uko wakwambara imbere y’umukunzi wawe mugihe m...,ni byiza kumenya uburyo witwara imbere yuwo mw...


In [30]:
# Sort the data according to the length of the content and assign the result to a new table 'sorted'
sorted = data.sort_values(by='content', key=lambda col: col.str.len())

# Add a column called "content_length" to see the number of characters in the content of each newspaper article (to confirm that the sorting was correct)
sorted['content_length'] = sorted['content'].str.len()

sorted

Unnamed: 0,label,title,content,content_length
174,4,bihozagara agiye gushyingurwa mpera z’iki cyum...,jacques bihozagara yapfiriye gereza mpimba bur...,64
456,5,umuhanzi noliva yatangaje ibintu bitatu umugor...,noliva yatangaje ibintu bitatu umugore wese ah...,76
453,6,gikongoro abatutsi basabwe kwiyirukana kazi,mu cyahoze perefegitura gikongoro hakozwe urut...,168
119,2,la tropicale amissa bongo abanyarwanda bagowe ...,abakinnyi team rwanda gihugu gabon bagowe n’ag...,173
112,5,itangazo ryo gusaba guhindura amazina kwa baha...,bahati prince mwene nkundakozera michel kazege...,204
...,...,...,...,...
325,6,inkotanyi zirimo gutsindwa- amagambo nyandiko ...,nyuma gusoma inyandiko ibanziriza hari umusomy...,8589
455,3,bamwe banyamakuru basanga itegeko ryo kubona a...,bamwe bakora umwuga w’itangazamakuru rwanda nd...,8733
14,2,- intashyo ikipe y’igihugu amavubi,-ese habuze kugeza magingo imyaka irihiritse a...,8909
58,5,umuraperi tupac aza kuba akiriho yujuje imyaka...,kugeza impaka ziracyari zose k’urupfu rwe hari...,11412
