Skip to content

ProportionBySport by ID (and not by Name) #115

@mli42

Description

@mli42
  • Day: 04
  • Exercise: 02

In this exercise, there is a special hint:

Hint: here and further, if needed, drop duplicated sportspeople to count only unique ones.
Beware to call the dropping function at the right moment and with the right parameters, in order not to omit any individuals.

And I finally found that in the given example, we found this result by removing people having the same name,
BUT, I think we should rather know which rows are duplicated people by their IDs and not names.

Examples

import pandas as pd
from FileLoader import FileLoader

def proportionBySport_ID(df: pd.DataFrame, yr: int, sport: str, gdr: str) -> float:
    df = df[(df["Year"]==yr) & (df["Sex"]==gdr)]
    df = df[~df.duplicated(subset=["ID"])] # <-- By ID
    df_res = df[df["Sport"]==sport]
    return (df_res.shape[0] / df.shape[0])

def proportionBySport_Name(df: pd.DataFrame, yr: int, sport: str, gdr: str) -> float:
    df = df[(df["Year"]==yr) & (df["Sex"]==gdr)]
    df = df[~df.duplicated(subset=["Name"])] # <-- By Name
    df_res = df[df["Sport"]==sport]
    return (df_res.shape[0] / df.shape[0])

if __name__ == "__main__":
    loader = FileLoader()
    data = loader.load('./resources/athlete_events.csv')
    print(proportionBySport_ID(data, 2004, 'Tennis', 'F')) # 0.019302325581395347
    print(proportionBySport_Name(data, 2004, 'Tennis', 'F')) # 0.01935634328358209
    print(0.01935634328358209, "-> Example's result")

==> prints:

Loading dataset of dimensions 271116 x 15
0.019302325581395347
0.01935634328358209
0.01935634328358209 -> Example's result

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions