**1. Import all the required Python Libraries.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).**

In [2]:
file_path = "/kaggle/input/fifa-2018/FIFA - 2018.csv"

**3. Load the Dataset into pandas dataframe.**

In [3]:
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Position,Team,Games Played,Win,Draw,Loss,Goals For,Goals Against,Goal Difference,Points
0,1,France,7,6,1,0,14,6,8,19
1,2,Croatia,7,4,2,1,14,9,5,14
2,3,Belgium,7,6,0,1,16,6,10,18
3,4,England,7,3,1,3,12,8,4,10
4,5,Uruguay,5,4,0,1,7,3,4,12


In [4]:
df.shape

(32, 10)

In [5]:
df.columns

Index(['Position', 'Team', 'Games Played', 'Win', 'Draw', 'Loss', 'Goals For',
       'Goals Against', 'Goal Difference', 'Points'],
      dtype='object')

**4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.**

In [6]:
df.isnull().sum()

Position           0
Team               0
Games Played       0
Win                0
Draw               0
Loss               0
Goals For          0
Goals Against      0
Goal Difference    0
Points             0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,Position,Games Played,Win,Draw,Loss,Goals For,Goals Against,Points
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,16.5,4.0,1.59375,0.8125,1.59375,5.28125,5.28125,5.59375
std,9.380832,1.344043,1.583369,0.895779,0.797552,4.065864,2.203141,4.804966
min,1.0,3.0,0.0,0.0,0.0,2.0,2.0,0.0
25%,8.75,3.0,1.0,0.0,1.0,2.0,4.0,3.0
50%,16.5,3.5,1.0,1.0,2.0,3.5,5.0,4.0
75%,24.25,4.25,2.0,1.0,2.0,6.25,6.25,7.25
max,32.0,7.0,6.0,3.0,3.0,16.0,11.0,19.0


**5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.**

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Position         32 non-null     int64 
 1   Team             32 non-null     object
 2   Games Played     32 non-null     int64 
 3   Win              32 non-null     int64 
 4   Draw             32 non-null     int64 
 5   Loss             32 non-null     int64 
 6   Goals For        32 non-null     int64 
 7   Goals Against    32 non-null     int64 
 8   Goal Difference  32 non-null     object
 9   Points           32 non-null     int64 
dtypes: int64(8), object(2)
memory usage: 2.6+ KB


In [9]:
df.dtypes


Position            int64
Team               object
Games Played        int64
Win                 int64
Draw                int64
Loss                int64
Goals For           int64
Goals Against       int64
Goal Difference    object
Points              int64
dtype: object

In [10]:
df['Goal Difference'] = (
    df['Goal Difference']
    .str.replace('−', '-', regex=False)   # Unicode minus → ASCII minus
    .str.replace('+', '', regex=False)    # Remove plus sign
    .astype(int)
)


In [11]:
df.dtypes

Position            int64
Team               object
Games Played        int64
Win                 int64
Draw                int64
Loss                int64
Goals For           int64
Goals Against       int64
Goal Difference     int64
Points              int64
dtype: object

**6. Turn categorical variables into quantitative variables in Python.**

In [12]:
df.select_dtypes(include='object').columns

Index(['Team'], dtype='object')

In [13]:
df['Team_Code'] = df['Team'].astype('category').cat.codes

In [14]:
df[['Team', 'Team_Code']].head()

Unnamed: 0,Team,Team_Code
0,France,10
1,Croatia,6
2,Belgium,2
3,England,9
4,Uruguay,31
