## Data Manipulation 

Here is a sample of the final data taken from the website Chasingball. From my dataset I took a sample of 300 plays leading to 600 data concerning players ; the 300 CSV are located in the same file ; with the players data being their statistics.

Let's start manipulating the data, firstly we need to add a column result to specify the winner or the loser in each CSV that we have. For that we use the column goals and the player with the higher number is the winner and the other is the loser. After that we get the files with our 300 plays with the column result.
Then we merged them in one dataset so we have one dataset for the 300 plays with the column result.
Finally we clean the final dataset.

### 1. Librairy import

In [41]:
import pandas as pd
import os
from tqdm import tqdm

### 2. Data manipulation

#### 2.1. Creating the column result

Here we add the column result.

##### 2.1.1. Going through all the CSV

In [42]:
# We create another file were they will be all of our new CSV with the new column.

folder_path = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\BallChasing_CSVs"
output_folder = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\BallChasing_CSVs_Final"
os.makedirs(output_folder, exist_ok=True)

# Now we go through each CSV.

csv_files = [f for f in os.listdir(folder_path) if f.endswith(".csv")]

##### 2.1.2. Creating the column and saving the file

We add the column result and we choose who is the winner depending on the player with the higher goals.

In [43]:
# We create the column and we add a progression bar to follow the advancement of the code.

for file in tqdm(csv_files, desc="Result column is added"):
    file_path = os.path.join(folder_path, file)

    try:
        df = pd.read_csv(file_path, sep=";")
        
        # Find the person with the higher score.
        
        max_goals = df["goals"].max()

        # Wa add the column.
        
        df["result"] = df["goals"].apply(lambda x: "winner" if x == max_goals else "loser")
        
        # We save each CSV in the final file.

        output_path = os.path.join(output_folder, file)
        df.to_csv(output_path, sep=";", index=False)

    except Exception as e:
        print(f"Error in {file} : {e}")

print(f"Column 'result' added in all the CSV : {output_folder}")


Result column is added: 100%|██████████| 300/300 [00:01<00:00, 154.00it/s]

Column 'result' added in all the CSV : C:\Users\Yacine\Downloads\Analyse de données L2\MVP\BallChasing_CSVs_Final





#### 2.2. Merging all the CSV into one

Here we will create our final CSV by merging our 300 CSV into one, so we can get our final dataset.

##### 2.2.1. Going through all the CSV

In [44]:
# Now we go through each CSV.

input_folder = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\BallChasing_CSVs_Final"
output_file = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\Dataset_MVP.csv"
csv_files = [f for f in os.listdir(input_folder) if f.endswith(".csv")]

##### 2.2.2. Creating the final dataset

In [45]:
# We merge all the CSV into one with a progression bar to follow the advancement of the code.

dataframes = []

for file in tqdm(csv_files, desc="CSV merging"):
    file_path = os.path.join(input_folder, file)
    try:
        df = pd.read_csv(file_path, sep=";")
        df["source_file"] = file  # We keep the original file
        dataframes.append(df)
    except Exception as e:
        print(f"problem with {file} : {e}")  # To check if there is a problem

# Now we create the final dataset.

if dataframes:
    merged_df = pd.concat(dataframes, ignore_index=True)
    merged_df.to_csv(output_file, sep=";", index=False)
    print(f"The dataset is completed : {output_file}")
else:
    print("The merging didn't work") # To check if there is a problem



CSV merging:   0%|          | 0/300 [00:00<?, ?it/s]

CSV merging: 100%|██████████| 300/300 [00:01<00:00, 295.51it/s]


The dataset is completed : C:\Users\Yacine\Downloads\Analyse de données L2\MVP\Dataset_MVP.csv


#### 2.3. Cleaning the final dataset

We clean the final dataset by deleting the columns that we don't need as they are specific to plays in teams but we focus here on solo plays.

##### 2.3.1. Going through the CSV

In [46]:
input_file = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\Dataset_MVP.csv"
output_file = r"C:\Users\Yacine\Downloads\Analyse de données L2\MVP\Final_Dataset_MVP.csv"
df = pd.read_csv(input_file, sep=";")

##### 2.3.2. Dropping the columns

The column that we target are "team name", "assists" and "avg distance to team mates".

In [None]:
# Deleting the columns that we don't need.

cols_to_drop = ["team name", "assists", "avg distance to team mates"]
df.columns = [c.strip().lower() for c in df.columns]
cols_to_drop_normalized = [c.strip().lower() for c in cols_to_drop]
cols_to_drop_existing = [c for c in cols_to_drop_normalized if c in df.columns]
df = df.drop(columns=cols_to_drop_existing)
print(f"Columns deleted: {cols_to_drop_existing}")



Columns deleted: ['team name', 'assists', 'avg distance to team mates']


##### 2.3.3. Saving the final dataset

In [48]:
df.to_csv(output_file, sep=";", index=False)
print(f"Final dataset created: {output_file}")

Final dataset created: C:\Users\Yacine\Downloads\Analyse de données L2\MVP\Final_Dataset_MVP.csv


Now we have the final dataset for the MVP. We will use this sample to see if my project is possible with the prediction models for the MVP.