# Data Cleaning and Preprocessing 1

The first data cleaning script aims to prepare the raw data for further analysis. In this tutorial, we'll cover some common data preprocessing techniques using both Python's built-in dictionaries and the Pandas library. Data preprocessing is a crucial step in the data analysis and machine learning process. It involves cleaning, transforming, and organizing data to make it suitable for analysis.

<img src="https://s.yimg.com/ny/api/res/1.2/CnYKko4HHp.BhQPe2G_dEQ--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTY0MDtjZj13ZWJw/https://media.zenfs.com/en/the_independent_635/769602519fa97ace462f965de78b1d5d" alt="Image" width="500" height="300">

## Working with Files and Directories using the `os` Module

The `os` module in Python provides a way of using operating system dependent functionality like reading or writing to the file system. It allows you to interact with the file system in a platform-independent way.

### Importing the `os` Module

Before we can use the `os` module, we need to import it. This is done using the following statement:

```python
import os



### Checking if a File Exists

To check if a file exists at a given path, you can use the `os.path.exists()` function.

```python
file_path = "/path/to/your/file.txt"
if os.path.exists(file_path):
    print(f"The file {file_path} exists.")
else:
    print(f"The file {file_path} does not exist.")



###  Getting File Information

You can retrieve various information about a file, such as its size, creation time, and modification time using functions like `os.path.getsize()`, `os.path.getctime()`, and `os.path.getmtime()`.

```python
file_path = "/path/to/your/file.txt"
size = os.path.getsize(file_path)
ctime = os.path.getctime(file_path)
mtime = os.path.getmtime(file_path)

print(f"Size of the file: {size} bytes")
print(f"Creation time: {ctime}")
print(f"Last modification time: {mtime}")


### Creating a Directory

You can create a new directory using the `os.mkdir()` function.

```python

dir_path = "/path/to/your/new_directory"
os.mkdir(dir_path)


### Listing Files in a Directory

To get a list of all the files and directories in a directory, you can use `os.listdir()`.

```python
dir_path = "/path/to/your/directory"
contents = os.listdir(dir_path)
print(f"Contents of {dir_path}: {contents}")


### Joining Paths

You can join two or more path components using `os.path.join()`.

```python
path1 = "/path/to/your"
path2 = "directory"
full_path = os.path.join(path1, path2)
print(f"Full path: {full_path}")


# Lets start!

In [2]:
#imports 
import os
import pandas as pd
import json
import numpy as np

In [3]:
# Define the path to the folder containing JSON files
folder_path = r"D:\data\json"  # Replace with the actual path to folder

# Initialize an empty list to store file names
files = []

# Iterate through files in the specified folder
for file_name in os.listdir(folder_path):
    # Check if the file starts with "team_" and ends with ".json"
    if file_name.startswith("team_") and file_name.endswith(".json"):
        files.append(file_name)  # Add the file name to the list


In [4]:
file=(os.path.join(folder_path,files[10]))
file

'D:\\data\\json\\team_Afghanistan_102115.json'

In [5]:
with open(file, 'r') as json_file:
    data = json.load(json_file)

In [7]:
match ={
"date":data['date'],
"home":data['home'],
"place":data['place'],
"result":data['result']}
match

{'date': '06/04/2023 04:30:00 AM',
 'home': 'Sri Lanka',
 'place': 'Mahinda Rajapaksa International Cricket Stadium',
 'result': 'Sri Lanka won by 132 runs'}

### We will be evaluating batting by:

#### Batting Average (BA):

Formula: Total Runs / Total Wickets

Description: This metric gives you an average score per wicket. It is a good indicator of how well a player is performing in terms of consistency.

Example:

Batting Average = 103 runs / 1 wicket = 103


#### Strike Rate (SR):

Formula: (Total Runs / Total Balls Faced) * 100

Description: This metric reflects how quickly a player scores runs. A higher strike rate indicates more aggressive and faster scoring.

Example:

Strike Rate = (103 runs / 127 balls) * 100 ≈ 81.10


#### Boundary Percentage (BP):

Formula: ((Total Fours + Total Sixes) / Total Runs) * 100

Description: This metric measures the percentage of runs scored through boundaries (fours and sixes). It indicates the player's ability to find the gaps and hit boundaries.

Example:

Boundary Percentage = ((6 + 3) / 103) * 100 ≈ 8.74%



In [8]:
def battingvars(data):
    # Initialize an empty list to store calculated Batting Bonus Points (BBP)
    BBP = []
    
    # Loop through each batting record in the data
    for i in data['batting']:
        # Check if 'Runs' is a valid value (not '-')
        if i['Runs'] == '-':
            continue # Skip this record if 'Runs' is not valid
        if int(i['Runs']) > 0:
            # Calculate BBP and append to the list
            BBP.append(((4 * int(i['Fours']) + 6 * int(i['Sixes'])) / int(i['Runs'])) * 100)
        else:
            BBP.append(0) # If 'Runs' is 0, set BBP to 0
    
    # Initialize an empty list to store Strike Rates (sr)
    sr = []
    
    # Loop through each batting record in the data
    for i in data['batting']:
        # Check if 'Strike Rate' is a valid value (not '-')
        if i['Strike Rate'] == '-':
            continue # Skip this record if 'Strike Rate' is not valid
        else:
            # Convert 'Strike Rate' to float and append to the list
            sr.append(float(i['Strike Rate']))
    
    # Calculate bata (ratio of runs scored to wickets lost)
    bata = float(data['score'].split("/")[0]) / float(data['score'].split("/")[1])
    
    # Return a list containing bata, mean of sr, and mean of BBP
    return [bata, np.mean(sr), np.mean(BBP)]


In [9]:
battingvars(data["Team1"])

[53.833333333333336, 118.71375, 46.23984864780165]

### For Balling we will use:

#### Bowling Average (BA):

Formula: Total Runs Conceded / Total Wickets Taken
    
Description: This metric gives you an average number of runs conceded per wicket taken. A lower bowling average indicates a more effective bowler.
Example:

Bowling Average = 32 runs / 4 wickets = 8.00


#### Economy Rate (ER):

Formula: Total Runs Conceded / Total Overs Bowled
    
Description: This metric reflects how many runs a bowler concedes on average in an over. A lower economy rate is generally better.
Example:

Economy Rate = 32 runs / 9.4 overs ≈ 3.40


#### Strike Rate (SR):

Formula: Total Balls Bowled / Total Wickets Taken
    
Description: This metric indicates how many balls a bowler takes on average to take a wicket. A lower strike rate is usually better.
Example:

Strike Rate = 56 balls / 4 wickets = 14.00


#### Dot Ball Percentage (DBP):

Formula: (Total Dots Bowled / Total Balls Bowled) * 100
    
Description: This metric measures the percentage of deliveries that didn't result in any runs. It reflects the bowler's ability to build pressure on the batsmen.
Example:

Dot Ball Percentage = (39 dots / 56 balls) * 100 ≈ 69.64%

In [19]:
def overs_to_balls(overs):
    overs = float(overs)
    balls = int(overs) * 6 + int(round((overs % 1) * 10,2))
    return balls

overs = '8.4'
balls = overs_to_balls(overs)

print(f'{overs} overs is equal to {balls} balls.')

8.4 overs is equal to 52 balls.


In [11]:
def ballingvars(data):
    # Initialize an empty list to store calculated Dot Ball Percentage (DBP)
    DBP = []
    
    # Loop through each bowling record in the data
    for i in data['balling']:
        # Check if 'Overs' is a valid value (not '-')
        if i['Overs'] == '-':
            continue  # Skip this record if 'Overs' is not valid
        else:
            balls = overs_to_balls(i['Overs'])  # Convert overs to balls
            # Calculate DBP and append to the list
            DBP.append((int(i['Dots']) / balls) * 100)
    
    # Initialize an empty list to store Bowling Strike Rates (ballsr)
    ballsr = []
    
    # Loop through each bowling record in the data
    for i in data['balling']:
        # Check if 'Overs' is a valid value (not '-')
        if i['Overs'] == '-':
            continue  # Skip this record if 'Overs' is not valid
        else:
            balls = overs_to_balls(i['Overs'])  # Convert overs to balls
            try:
                # Calculate Bowling Strike Rate and append to the list
                ballsr.append(float(balls / int(i['Wickets'])))
            except:
                continue
    
    # Initialize an empty list to store Economy Rates (eco)
    eco = []
    
    # Loop through each bowling record in the data
    for i in data['balling']:
        # Check if 'Economy' is a valid value (not '-')
        if i['Economy'] == '-':
            continue  # Skip this record if 'Economy' is not valid
        else:
            # Convert 'Economy' to float and append to the list
            eco.append(float(i['Economy']))
    
    # Initialize an empty list to store Bowling Average (balla)
    balla = []
    
    # Loop through each bowling record in the data
    for i in data['balling']:
        # Check if 'Wickets' is a valid value (not '-')
        if i['Wickets'] == '-':
            continue  # Skip this record if 'Wickets' is not valid
        else:
            try:
                # Calculate Bowling Average and append to the list
                balla.append(int(i['Runs']) / int(i['Wickets']))
            except:
                pass
    
    # Return a list containing mean of balla, mean of eco, mean of ballsr, and mean of DBP
    return [np.mean(balla), np.mean(eco), np.mean(ballsr), np.mean(DBP)]


In [12]:

#<-------------------------------------Lets write them together and transform data ------------------------------------>


def battingvars(data):
    if data['batting'] == 'None':
        return ['','','']
    else:
        BBP=[]
        for i in data['batting']:
            if i['Runs'] == '-':
                continue
            if int(i['Runs'])>0:
                BBP.append(((4*int(i['Fours'])+6*int(i['Sixes']))/int(i['Runs']))*100)
            else:
                BBP.append(0)
        sr=[]
        for i in data['batting']:
            if i['Strike Rate'] == '-':
                continue
            else:
                sr.append(float(i['Strike Rate']))
        try:
            bata=float(data['score'].split("/")[0])/float(data['score'].split("/")[1])
        except IndexError:
            bata=float(data['score'].split("/")[0])/10
        except ZeroDivisionError:
            bata=0
        return [bata, np.mean(sr), np.mean(BBP)]

def overs_to_balls(overs):
    overs = float(overs)
    balls = int(overs) * 6 + int((overs % 1) * 10)
    return balls


def ballingvars(data):
    if data['balling'] == []:
        return ['','','','']
    else:
        DBP=[]
        for i in data['balling']:
            if i['Overs'] == '-':
                continue
            else:
                balls = overs_to_balls(i['Overs'])
                DBP.append((int(i['Dots'])/balls)*100)
        ballsr=[]
        for i in data['balling']:
            if i['Overs'] == '-':
                continue
            else:
                balls = overs_to_balls(i['Overs'])
                try:
                    ballsr.append(float(balls/int(i['Wickets'])))
                except:
                    continue
        eco=[]
        for i in data['balling']:
            if i['Economy'] == '-':
                continue
            else:
                eco.append(float(i['Economy']))
        balla=[]
        for i in data['balling']:
            if i['Wickets'] == '-':
                continue
            else:
                try:
                    balla.append(int(i['Runs'])/int(i['Wickets']))
                except:
                    pass
        return [np.mean(balla),np.mean(eco),np.mean(ballsr),np.mean(DBP)]

def maketeamdata(data):
    # Create a dictionary 'match1' with relevant fields from 'data'
    match1 = {
        "team_name": data['Name'],
        "score": data['score'],
        "overs": data['overs'],
        "rr": data['rr'],
        'extras': data['Extras']
    }

    # Calculate batting statistics using 'battingvars' function and update 'match1'
    bat = battingvars(data)
    match1["bata"] = bat[0]  # Batting Average
    match1["batsr"] = bat[1]  # Batting Strike Rate
    match1["bbp"] = bat[2]  # Batting Bonus Points

    # Calculate bowling statistics using 'ballingvars' function and update 'match1'
    ball = ballingvars(data)
    match1["balla"] = ball[0]  # Bowling Average
    match1["eco"] = ball[1]  # Economy Rate
    match1["ballsr"] = ball[2]  # Bowling Strike Rate
    match1["dbp"] = ball[3]  # Dot Ball Percentage

    return match1  # Return the updated 'match1' dictionary




#<-----------------------------------------Trigger code -------------------------------------------->


folder_path = r"D:\data\json"  # Replace with the actual path to files
files=[]
for file_name in os.listdir(folder_path):
    if file_name.startswith("team_") and file_name.endswith(".json"):
        files.append(file_name)

# Initialize an empty list to store match data
lis = []

# Loop through each file in 'files'
for fileadd in files:
    # Get the full file path
    file = os.path.join(folder_path, fileadd)

    # Open and read the JSON file
    with open(file, 'r') as json_file:
        data = json.load(json_file)

    # Check if the match result is not one of the specified cases
    if data['result'].lower() not in ['match abandoned', 'match abandoned without a ball bowled', 'no result', 'match cancelled without a ball bowled', 'match postponed without a ball bowled']:
        # Create a dictionary 'match' with relevant fields from 'data'
        match = {
            "date": data['date'],
            "home": data['home'],
            "place": data['place'],
            "result": data['result'],
            'match_id': data['match_id']
        }

        # Iterate through both teams (Team1 and Team2)
        for team in ['Team1', 'Team2']:
            # Update 'match' with team-specific data using 'maketeamdata' function
            match.update(maketeamdata(data[team]))

            # Append the updated 'match' to the list 'lis'
            lis.append(match)

            # Reset 'match' for the next team
            match = {
                "date": data['date'],
                "home": data['home'],
                "place": data['place'],
                "result": data['result'],
                'match_id': data['match_id']
            }


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [13]:
df=pd.DataFrame(lis) #make a df.

In [15]:
df.to_csv(r"D:\data\odi2023.csv")# save to pc.