# World Cup Matches -  Notebook
This notebook will guide you through:
- Loading and inspecting the dataset with Pandas
- Selecting specific rows and columns
- Filtering data with boolean masks
- Adding new columns and modifying existing ones

## Objectives
1. Access information about a dataset with pandas methods.
2. Select rows and columns using `.loc` and `.iloc`.
3. Use boolean indexing to filter data.
4. Add and modify columns in a DataFrame.

In [1]:
import pandas as pd

# Load the dataset
world_cup_df= pd.read_csv('WorldCupMatches.csv')

# Display the first few rows
world_cup_df.head(10)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,1930,13 Jul 1930 - 15:00,Group 1,Pocitos,Montevideo,France,4,1,Mexico,,4444.0,3,0,LOMBARDI Domingo (URU),CRISTOPHE Henry (BEL),REGO Gilberto (BRA),201,1096,FRA,MEX
1,1930,13 Jul 1930 - 15:00,Group 4,Parque Central,Montevideo,USA,3,0,Belgium,,18346.0,2,0,MACIAS Jose (ARG),MATEUCCI Francisco (URU),WARNKEN Alberto (CHI),201,1090,USA,BEL
2,1930,14 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,2,1,Brazil,,24059.0,2,0,TEJADA Anibal (URU),VALLARINO Ricardo (URU),BALWAY Thomas (FRA),201,1093,YUG,BRA
3,1930,14 Jul 1930 - 14:50,Group 3,Pocitos,Montevideo,Romania,3,1,Peru,,2549.0,1,0,WARNKEN Alberto (CHI),LANGENUS Jean (BEL),MATEUCCI Francisco (URU),201,1098,ROU,PER
4,1930,15 Jul 1930 - 16:00,Group 1,Parque Central,Montevideo,Argentina,1,0,France,,23409.0,0,0,REGO Gilberto (BRA),SAUCEDO Ulises (BOL),RADULESCU Constantin (ROU),201,1085,ARG,FRA
5,1930,16 Jul 1930 - 14:45,Group 1,Parque Central,Montevideo,Chile,3,0,Mexico,,9249.0,1,0,CRISTOPHE Henry (BEL),APHESTEGUY Martin (URU),LANGENUS Jean (BEL),201,1095,CHI,MEX
6,1930,17 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,4,0,Bolivia,,18306.0,0,0,MATEUCCI Francisco (URU),LOMBARDI Domingo (URU),WARNKEN Alberto (CHI),201,1092,YUG,BOL
7,1930,17 Jul 1930 - 14:45,Group 4,Parque Central,Montevideo,USA,3,0,Paraguay,,18306.0,2,0,MACIAS Jose (ARG),APHESTEGUY Martin (URU),TEJADA Anibal (URU),201,1097,USA,PAR
8,1930,18 Jul 1930 - 14:30,Group 3,Estadio Centenario,Montevideo,Uruguay,1,0,Peru,,57735.0,0,0,LANGENUS Jean (BEL),BALWAY Thomas (FRA),CRISTOPHE Henry (BEL),201,1099,URU,PER
9,1930,19 Jul 1930 - 12:50,Group 1,Estadio Centenario,Montevideo,Chile,1,0,France,,2000.0,0,0,TEJADA Anibal (URU),LOMBARDI Domingo (URU),REGO Gilberto (BRA),201,1094,CHI,FRA


## Basic Data Inspection
We'll look at the last rows, data info, shape, and column names.

In [11]:
# Display last rows
world_cup_df.tail(3)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
849,2014,09 Jul 2014 - 17:00,Semi-finals,Arena de Sao Paulo,Sao Paulo,Netherlands,0,0,Argentina,Argentina win on penalties (2 - 4),63267.0,0,0,C�neyt �AKIR (TUR),DURAN Bahattin (TUR),ONGUN Tarik (TUR),255955,300186490,NED,ARG
850,2014,12 Jul 2014 - 17:00,Play-off for third place,Estadio Nacional,Brasilia,Brazil,0,3,Netherlands,,68034.0,0,2,HAIMOUDI Djamel (ALG),ACHIK Redouane (MAR),ETCHIALI Abdelhak (ALG),255957,300186502,BRA,NED
851,2014,13 Jul 2014 - 16:00,Final,Estadio do Maracana,Rio De Janeiro,Germany,1,0,Argentina,Germany win after extra time,74738.0,0,0,Nicola RIZZOLI (ITA),Renato FAVERANI (ITA),Andrea STEFANI (ITA),255959,300186501,GER,ARG


In [8]:
# Get a concise summary of the data
world_cup_df.info()

#df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 852 entries, 0 to 851
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Year                  852 non-null    int64  
 1   Datetime              852 non-null    object 
 2   Stage                 852 non-null    object 
 3   Stadium               852 non-null    object 
 4   City                  852 non-null    object 
 5   Home Team Name        852 non-null    object 
 6   Home Team Goals       852 non-null    int64  
 7   Away Team Goals       852 non-null    int64  
 8   Away Team Name        852 non-null    object 
 9   Win conditions        852 non-null    object 
 10  Attendance            850 non-null    float64
 11  Half-time Home Goals  852 non-null    int64  
 12  Half-time Away Goals  852 non-null    int64  
 13  Referee               852 non-null    object 
 14  Assistant 1           852 non-null    object 
 15  Assistant 2           8

In [18]:
world_cup_df.describe(include= "all")# shows a statistical summary of your data set
#NaN means 'Not a Number'

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
count,852.0,852,852,852,852,852,852.0,852.0,852,852.0,850.0,852.0,852.0,852,852,852,852.0,852.0,852,852
unique,,602,23,181,151,78,,,83,43.0,,,,366,387,408,,,77,82
top,,27 May 1934 - 16:30,Round of 16,Estadio Azteca,Mexico City,Brazil,,,Mexico,,,,,Ravshan IRMATOV (UZB),BERANEK Alois (AUT),KOCHKAROV Bakhadyr (KGZ),,,BRA,MEX
freq,,8,72,19,23,82,,,38,787.0,,,,10,7,10,,,82,38
mean,1985.089202,,,,,,1.811033,1.0223,,,45164.8,0.70892,0.428404,,,,10661770.0,61346870.0,,
std,22.448825,,,,,,1.610255,1.087573,,,23485.249247,0.937414,0.691252,,,,27296130.0,111057200.0,,
min,1930.0,,,,,,0.0,0.0,,,2000.0,0.0,0.0,,,,201.0,25.0,,
25%,1970.0,,,,,,1.0,0.0,,,30000.0,0.0,0.0,,,,262.0,1188.75,,
50%,1990.0,,,,,,2.0,1.0,,,41579.5,0.0,0.0,,,,337.0,2191.0,,
75%,2002.0,,,,,,3.0,2.0,,,61374.5,1.0,1.0,,,,249722.0,43950060.0,,


In [14]:
# Get the shape (rows, columns)
world_cup_df.shape


(852, 20)

In [47]:
# Get column names
world_cup_df.columns #strings are the same as objects
print(len(world_cup_df.columns))

20


## Selecting Rows and Columns
Use `.iloc` and `.loc` for slicing and indexing.

In [30]:
# Select rows by index from 3 to 5
world_cup_df.iloc[3:6]#The stop index is not inclusive


Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
3,1930,14 Jul 1930 - 14:50,Group 3,Pocitos,Montevideo,Romania,3,1,Peru,,2549.0,1,0,WARNKEN Alberto (CHI),LANGENUS Jean (BEL),MATEUCCI Francisco (URU),201,1098,ROU,PER
4,1930,15 Jul 1930 - 16:00,Group 1,Parque Central,Montevideo,Argentina,1,0,France,,23409.0,0,0,REGO Gilberto (BRA),SAUCEDO Ulises (BOL),RADULESCU Constantin (ROU),201,1085,ARG,FRA
5,1930,16 Jul 1930 - 14:45,Group 1,Parque Central,Montevideo,Chile,3,0,Mexico,,9249.0,1,0,CRISTOPHE Henry (BEL),APHESTEGUY Martin (URU),LANGENUS Jean (BEL),201,1095,CHI,MEX


In [34]:
# Select rows by index 5 to 9, only 'Home Team Name' and 'Away Team Name'
# df.loc[row_selector, column_selector]
world_cup_df.loc[5:9, ['Home Team Name', 'Away Team Name', 'Year','City']]

Unnamed: 0,Home Team Name,Away Team Name,Year,City
5,Chile,Mexico,1930,Montevideo
6,Yugoslavia,Bolivia,1930,Montevideo
7,USA,Paraguay,1930,Montevideo
8,Uruguay,Peru,1930,Montevideo
9,Chile,France,1930,Montevideo


# loc vs. iloc in Pandas (Very Simple Explanation)
An index in Pandas is like a unique identifier for each row in a DataFrame. It helps you quickly locate and access data.
When you create a DataFrame, Pandas automatically assigns an index (which is just row numbers starting from 0).

**iloc** is integer position-based.
- You use it when you want to select rows or columns by their position (like counting from 0, 1, 2, …).

Example:
- To get the first row (regardless of its label), you can use:
- df.iloc[0]

If you want to use a specific column (like "Name") as the index:
**loc** is label-based.  
You use it when you want to select rows or columns by their names (the labels in the index or column headers).

**Example:**  
- If your DataFrame has a row labeled "Alice", you can get that row with:
- df.loc["Alice"] OR 
- df_Alice = df[df['Name'] == "Alice"]


Example of using .loc



In [35]:
students_df= pd.read_csv('students.csv')
students_df

Unnamed: 0,Name,Age,Grade
0,Alice,14,A
1,Bob,15,B
2,Charlie,14,C
3,David,15,A
4,Eva,14,B


In [43]:
students_df.index

Index(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], dtype='object', name='Name')

In [None]:
students_df.iloc[0:5]# the i in loc here can help you identify what is index based

Unnamed: 0,Name,Age,Grade
0,Alice,14,A
1,Bob,15,B
2,Charlie,14,C
3,David,15,A
4,Eva,14,B


In [42]:
students_df.loc['David']

Age      15
Grade     A
Name: David, dtype: object

In [41]:
students_df.set_index('Name', inplace=True)

## Boolean Masking
Boolean masking is a powerful technique in Pandas that allows you to filter rows in a DataFrame based on True/False conditions.
- Filter rows based on multiple conditions (&, |, ~ for AND, OR, NOT).


In [46]:
# Example: Find all games in Group 3 for the 1950 World Cup
df_1950_group3 = world_cup_df[(world_cup_df['Year'] == 1950) & (world_cup_df['Stage'] == 'Group 3')]
df_1950_group3

# Display only the attendance column for these filtered rows
df_1950_group3['City']

56    Sao Paulo 
61     Curitiba 
65    Sao Paulo 
Name: City, dtype: object

## Creating and Modifying Columns
Here, we'll create a "Total Goals" column and show how to modify certain values.

In [48]:
# Create 'Total Goals' column
world_cup_df['Total Goals'] = world_cup_df['Home Team Goals'] + world_cup_df['Away Team Goals']

# Create 'Half-time Goals' column (sum of home and away half-time goals)
world_cup_df['Half-time Goals'] = world_cup_df['Half-time Home Goals'] + world_cup_df['Half-time Away Goals']

# Check updated columns
world_cup_df[['Home Team Name','Away Team Name','Total Goals','Half-time Goals']].head()

Unnamed: 0,Home Team Name,Away Team Name,Total Goals,Half-time Goals
0,France,Mexico,5,3
1,USA,Belgium,3,2
2,Yugoslavia,Brazil,3,2
3,Romania,Peru,4,1
4,Argentina,France,1,0


In [49]:
world_cup_df.info

<bound method DataFrame.info of      Year              Datetime                     Stage  \
0    1930  13 Jul 1930 - 15:00                    Group 1   
1    1930  13 Jul 1930 - 15:00                    Group 4   
2    1930  14 Jul 1930 - 12:45                    Group 2   
3    1930  14 Jul 1930 - 14:50                    Group 3   
4    1930  15 Jul 1930 - 16:00                    Group 1   
..    ...                   ...                       ...   
847  2014  05 Jul 2014 - 17:00             Quarter-finals   
848  2014  08 Jul 2014 - 17:00                Semi-finals   
849  2014  09 Jul 2014 - 17:00                Semi-finals   
850  2014  12 Jul 2014 - 17:00   Play-off for third place   
851  2014  13 Jul 2014 - 16:00                      Final   

                 Stadium             City Home Team Name  Home Team Goals  \
0                Pocitos      Montevideo          France                4   
1         Parque Central      Montevideo             USA                3   
2   

In [50]:
# Example of modifying entries that contain 'Korea'
world_cup_df.loc[world_cup_df['Home Team Name'].str.contains('Korea'), 'Home Team Name'] = 'North Korea'
world_cup_df.loc[world_cup_df['Away Team Name'].str.contains('Korea'), 'Away Team Name'] = 'South Korea'

# Check updated entries
world_cup_df.loc[world_cup_df['Home Team Name'].str.contains('Korea'), ['Home Team Name']]


Unnamed: 0,Home Team Name
179,North Korea
187,North Korea
374,North Korea
386,North Korea
434,North Korea
444,North Korea
480,North Korea
524,North Korea
593,North Korea
609,North Korea


In [51]:
world_cup_df.loc[world_cup_df['Away Team Name'].str.contains('Korea'), ['Away Team Name']]

Unnamed: 0,Away Team Name
80,South Korea
88,South Korea
171,South Korea
195,South Korea
364,South Korea
421,South Korea
464,South Korea
490,South Korea
542,South Korea
556,South Korea


In [53]:
df_Puebla = world_cup_df[world_cup_df['City'] == 'Puebla ']
df_Puebla

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,...,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials,Total Goals,Half-time Goals
201,1970,02 Jun 1970 - 16:00,Group 2,Cuauhtemoc,Puebla,Uruguay,2,0,Israel,,...,0,DAVIDSON Bob (SCO),SCHEURER Ruedi (SUI),TAREKEGN Seyoum (ETH),250,1881,URU,ISR,2,1
208,1970,06 Jun 1970 - 16:00,Group 2,Cuauhtemoc,Puebla,Uruguay,0,0,Italy,,...,0,GLOECKNER Rudolf (GDR),TSCHENSCHER Kurt (GER),HORVAT Drago (YUG),250,1884,URU,ITA,0,0
216,1970,10 Jun 1970 - 16:00,Group 2,Cuauhtemoc,Puebla,Sweden,1,0,Uruguay,,...,0,LANDAUER Henry (USA),TAYLOR John (ENG),RADULESCU Andrei (ROU),250,1922,SWE,URU,1,0
372,1986,05 Jun 1986 - 12:00,Group A,Cuauhtemoc,Puebla,Italy,1,1,Argentina,,...,1,KEIZER Jan (NED),MARQUEZ RAMIREZ Antonio (MEX),SNODDY Alan (NIR),308,394,ITA,ARG,2,2
386,1986,10 Jun 1986 - 12:00,Group A,Cuauhtemoc,Puebla,North Korea,2,3,Italy,,...,1,SOCHA David (USA),URREA Joaquin (MEX),AL SHARIF Jamal (SYR),308,643,KOR,ITA,5,1
398,1986,16 Jun 1986 - 16:00,Round of 16,Cuauhtemoc,Puebla,Argentina,1,0,Uruguay,,...,0,AGNOLIN Luigi (ITA),COURTNEY George (ENG),SILVA VALENTE Carlos Alberto (POR),309,398,ARG,URU,1,1
406,1986,22 Jun 1986 - 16:00,Quarter-finals,Cuauhtemoc,Puebla,Spain,1,1,Belgium,Belgium win on penalties (4 - 5),...,0,KIRSCHEN Siegfried (GER),CODESAL MENDEZ Edgardo (MEX),BRUMMEIER Horst (AUT),714,421,ESP,BEL,2,0
410,1986,28 Jun 1986 - 12:00,Match for third place,Cuauhtemoc,Puebla,France,4,2,Belgium,France win after extra time,...,0,COURTNEY George (ENG),SILVA ARCE Hernan (CHI),AL SHARIF Jamal (SYR),3468,422,FRA,BEL,6,0


## Summary
In this notebook, we covered:
1. Loading data with Pandas.
2. Inspecting the dataset.
3. Selecting rows and columns.
4. Boolean filtering.
5. Creating and modifying columns.

Feel free to experiment with other Pandas methods to learn more!