## Object-oriented programming

### Defining your own classes

In [7]:
class RepeatText:

    def __init__(self, n_repeats):
        self.n_repeats = n_repeats

    def multiply_text(self, some_text):
        print((some_text + " ") * self.n_repeats)

In [8]:
%load_ext line_profiler
repeat_twice = RepeatText(2)

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


In [9]:
%lprun -f repeat_twice.multiply_text("hello")

hello hello 


Timer unit: 1e-07 s

In [10]:
import numpy as np

Let’s look at another example. This time let’s use the UN Sustainable Development Goal data introduced in Lesson 1. In the example below, I’m creating a Goal5Data object to hold some data relevant to Goal 5, `Achieve gender equality and empower all women and girls.` This particular object will hold data for one of the targets associated with this goal, Target 5.5: `Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life.`

I want to be able to create an object to store the data for each country so that I can easily manipulate it in the same way. Here’s the code to create the new class and hold the data:

In [11]:
class Goal5Data:

    def __init__(self, name, population, women_in_parliament):
        self.name = name
        self.population = population
        self.women_in_parliament = women_in_parliament

    def print_summary(self):
        null_women_in_parliament = len(self.women_in_parliament) - np.count_nonzero(
            self.women_in_parliament
        )
        print(
            f"There are {len(self.women_in_parliament)} data points for Indicator 5.1.1, 'Proportion of seats held by women in national parliaments'."
        )
        print(f"{null_women_in_parliament} are nulls.")

In [12]:
usa = Goal5Data(
    name="USA",
    population=336262544,
    women_in_parliament=[
        13.33,
        14.02,
        14.02,
        14.25,
        14.25,
        14.94,
        15.17,
        16.32,
        16.78,
        17.01,
        16.78,
        16.78,
        16.82,
        17.78,
        18.29,
        19.35,
        19.35,
        19.08,
        19.49,
        23.56,
        23.43,
        27.25,
        27.71,
        29.43,
    ],
)

In [13]:
usa.population

336262544

In [14]:
usa.print_summary()

There are 24 data points for Indicator 5.1.1, 'Proportion of seats held by women in national parliaments'.
0 are nulls.


### OOP Principles

In [15]:
from scipy.stats import linregress


class Goal5TimeSeries(Goal5Data):
    def __init__(self, name, population, women_in_parliament, timestamps):
        super().__init__(name, population, women_in_parliament)
        self.timestamps = timestamps

    def fit_trendline(self):
        result = linregress(self.timestamps, self.women_in_parliament)
        slope = round(result.slope, 3)
        r_squared = round(result.rvalue**2, 3)
        return slope, r_squared

In [16]:
nepal = Goal5TimeSeries(
    name="Nepal",
    population=30552151,
    women_in_parliament=[
        9.02,
        9.01,
        8.84,
        8.84,
        8.84,
        8.29,
        8.26,
        8.26,
        9.06,
        9.06,
        10.83,
        10.83,
        11.01,
        11.01,
        11.38,
        11.97,
        11.97,
        11.81,
        11.81,
        12.6,
        14.36,
        14.44,
        14.94,
        15.13,
    ],
    timestamps=[
        2000,
        2001,
        2002,
        2003,
        2004,
        2005,
        2006,
        2007,
        2008,
        2009,
        2010,
        2011,
        2012,
        2013,
        2014,
        2015,
        2016,
        2017,
        2018,
        2019,
        2020,
        2021,
        2022,
        2023,
    ],
)

In [17]:
nepal.print_summary()

There are 24 data points for Indicator 5.1.1, 'Proportion of seats held by women in national parliaments'.
0 are nulls.


In [18]:
nepal.fit_trendline()

(0.292, 0.869)

### Data Science OOP Example

Let’s say you are part of a team of data scientists that have just been tasked with quickly building a tool that queries the SportsDB API for data on major sports leagues.

The tool will be used by analysts across different departments of the company with the following requirements:

- Initially the tool should return, for a specified sports league, the last 5 home games (date, time, opponent and scores) of each team, but this is expected to evolve over time
- The analyst should not need to learn how SportsDB works
- The analyst should not need to know how to make API calls in Python

Reference💡: SportsDB is a crowd-sourced database of sports artwork and metadata with a free public API

Since there is some time pressure to get this out, most likely your team will just write a few simple functions based on the requirements. For example, here is a function that gets all gets all teams in a league:

In [19]:
import requests
import urllib.parse
def get_teams_in_league(league_name):
    base_url = "https://www.thesportsdb.com/api/v1/json/3/search_all_teams.php"
    query_string = "l=" + urllib.parse.quote(league_name.encode("utf8"))
    request_url = base_url + '?' + query_string
    response = requests.get(request_url)
    teams_json = response.json()["teams"]
    teams = [{"id": json["idTeam"], "name": json["strTeam"]} for json in teams_json]
    return teams

Similarly, we can write a function that gets the last 5 games of a team

In [20]:
import requests
import urllib.parse
def get_last_5_games(team_id):
    base_url = "https://www.thesportsdb.com/api/v1/json/3/eventslast.php"
    query_string = "id=%s" % (team_id)
    request_url = base_url + '?' + query_string
    response = requests.get(request_url)
    # print(response.text)
    
    games_json = response.json()["results"]
    games = [{
        "home": json["strHomeTeam"], 
        "home_score": json["intHomeScore"], 
        "away": json["strAwayTeam"], 
        "away_score": json["intAwayScore"], 
        "at": json["strTimestamp"
    ]} for json in games_json]
    return games

Finally, here is a function that gets the last 5 games of every team in a league

In [21]:
from itertools import chain
def get_last_5_games_for_league(league):
    teams = get_teams_in_league(league)
    # print (teams)
    last_5_by_team = [get_last_5_games(team["id"]) for team in teams]
    # print (last_5_by_team)
    last_5_by_team = list(chain.from_iterable(last_5_by_team))
    return last_5_by_team

Now you just have to put these functions into a .py file and make it accessible to any relevant environments (exactly how this is done depends on how the data platform set up in your company), and everyone can start using them in their notebooks, data pipes and models.

For example, we can get the last 5 games of every English Premier League team into a Pandas dataframe like this:

In [23]:
import pandas as pd
pd.DataFrame(get_last_5_games_for_league('English Premier League'))



Unnamed: 0,home,home_score,away,away_score,at
0,Arsenal,2,Everton,1,2024-05-19T15:00:00
1,Arsenal,3,Bournemouth,0,2024-05-04T11:30:00
2,Arsenal,5,Chelsea,0,2024-04-23T19:00:00
3,Arsenal,0,Aston Villa,2,2024-04-14T15:30:00
4,Arsenal,2,Bayern Munich,2,2024-04-09T19:00:00
...,...,...,...,...,...
95,Wolves,1,Crystal Palace,3,2024-05-11T14:00:00
96,Wolves,2,Luton,1,2024-04-27T14:00:00
97,Wolves,0,Bournemouth,1,2024-04-24T18:45:00
98,Wolves,0,Arsenal,2,2024-04-20T18:30:00


We can explore the API a bit more. You can get event/game specific info, information on teams and their players

In [49]:
import requests

def lookup_events(event_ids):
    for event_id in event_ids:
        api_call = requests.get(f"https://www.thesportsdb.com/api/v1/json/3/lookupevent.php?id={event_id}")
        storage = api_call.json()
        for event in storage["events"]:
            date_event = event["dateEvent"]
            home_team = event["strHomeTeam"]
            away_team = event["strAwayTeam"]

        print(f"{date_event}: {home_team} vs {away_team}")

event_ids = [2052711, 2052712, 2052713, 2052714]

lookup_events(event_ids)

2014-12-29: Liverpool vs Swansea
2014-12-29: Liverpool vs Swansea
2014-12-29: Liverpool vs Swansea
2014-12-29: Liverpool vs Swansea


In [50]:
api_call = requests.get("https://www.thesportsdb.com/api/v1/json/3/searchteams.php?t=Arsenal")
storage = api_call.json()
for team in storage["teams"]:
    print(team["strTeamBadge"])
    print(team["strStadium"])
    print(team["strLeague"])

https://www.thesportsdb.com/images/media/team/badge/uyhbfe1612467038.png
Emirates Stadium
English Premier League


In [51]:
api_call = requests.get("https://www.thesportsdb.com/api/v1/json/3/searchplayers.php?t=Arsenal")
storage = api_call.json()
# print (storage)
for player in storage["player"]:
    print(player["strPlayer"], player["strNationality"], player["strPosition"], player["strHeight"], player["strWeight"])

Reiss Nelson England Right Winger 5 ft 9 in (1.75 m) 71 kg
Gabriel Martinelli Brazil Left Wing 1.76 m (5 ft 9 in) 75 kg (165 lb)
Bukayo Saka England Right Winger 178 cm 65 kg
Oleksandr Zinchenko Ukraine Left-Back 1.75 m (5 ft 9 in) 61 kg
Takehiro Tomiyasu Japan Right-Back 188 cm 78 Kg
Thomas Partey Ghana Defensive Midfield 1.85 m (6 ft 1 in) 78 kg
Gabriel Jesus Brazil Centre-Forward 1.75 m (5 ft 9 in) 
David Raya Spain Goalkeeper 1.86 m (6 ft 1 in) 80 kg (176 lb)
Jurrien Timber The Netherlands Centre-Back 1.79 m (5 ft 10 in) 
Gabriel Magalhães Brazil Centre-Back 1.90 m (6 ft 2 in) 
Jorginho Italy Defensive Midfield 1.80 m (5 ft 11 in) 
Karl Hein Estonia Goalkeeper 1.93 m (6 ft 3 in) 
Emil Smith Rowe England Attacking Midfield 6 ft 0 in (1.83 m) 
Martin Ødegaard Norway Attacking Midfield 178cm / 5'10" 68kg / 150lbs
Mikel Arteta Spain Manager 177 cm 63 kg
William Saliba France Centre-Back 1.92 m (6 ft 4 in) 
Kai Havertz Germany Attacking Midfield 193cm / 6'4" 82kg / 181lbs
Leandro Trossa

### What’s wrong with writing functions?
Now you have given everyone in the company a way to interact with the data in SportsDB without having to know how to make API calls or know anything about the SportsDB API. That’s awesome! 🏆

So you may ask, what’s wrong with doing it this way? Well… imagine what you would do if one of these things happen one day:

<i>Problem 1: There is a company wide push on self-service analytics and you need a way to allow analysts across the company to add more functions and expose more data from SportsDB for their own use</i>

In the current implementation every function contains logic to make API calls, which makes it more difficult for analysts with less experience interacting with APIs to add additional functions.

One solution would be to add a function that is solely responsible for making API calls (and handling http errors):

If you make it a very generic function, then it probably should not sit in the same file as the other functions here
On the other hand, if the function is very tied to the SportsDB API, then large projects will become littered with functions that make API calls to many different APIs, which feels messy

#### How does OOP help?
As we have seen above, we can deliver this tool very easily through writing a few functions, but could quickly become difficult to maintain and extend as requirements and the API evolves.

However, if you are familiar with OOP design pattern, this tool could just as easily have been delivered as a class along the lines below:

In [52]:
import requests
import urllib.parse
from itertools import chain

class SportsDB:
    base_url = "https://www.thesportsdb.com/api/v1/json/3/"

    # constructor
    def __init__(self):
        pass
    
    # given dictionary of parameters return url query string
    def __make_query_string(self, params):
        query_string = "?"
        for param_name, param_value in params.items():
            query_string += param_name + "=" + param_value + "&"
        return query_string
    
    # make API call - handle any http errors
    def call(self, method, params, payload_name):
        request_url = SportsDB.base_url + method + self.__make_query_string(params)
        response = requests.get(request_url)
        return response.json().get(payload_name, [])

    def get_teams_in_league(self, league_name):
        method = "search_all_teams.php"
        payload_name = "teams"
        params = {"l": urllib.parse.quote(league_name.encode("utf8"))}
        payload = self.call(method, params, payload_name)
        teams = [{"id": json["idTeam"], "name": json["strTeam"]} for json in payload]
        return teams

    def get_last_5_games(self, team_id):
        method = "eventslast.php"
        payload_name = "results"
        params = {"id": str(team_id)}
        payload = self.call(method, params, payload_name)
        games = [{
            "home": json["strHomeTeam"], 
            "home_score": json["intHomeScore"], 
            "away": json["strAwayTeam"], 
            "away_score": json["intAwayScore"],
            "at_local": json["strTimestamp"]
        } for json in payload]
        return games

    def get_last_5_games_for_league(self, league):
        teams = self.get_teams_in_league(league)
        last_5_by_team = [self.get_last_5_games(team["id"]) for team in teams]
        last_5_by_team = list(chain.from_iterable(last_5_by_team))
        return last_5_by_team

Just as before, depending on how your company / data science platform manages environments, simply make this package available for import, and use it anywhere like so:

In [53]:
import pandas as pd
sdb = SportsDB()
last_5_games = sdb.get_last_5_games_for_league("Indian Premier League")
pd.DataFrame(last_5_games)

Unnamed: 0,home,home_score,away,away_score,at_local
0,Chennai Super Kings,145.0,Rajasthan Royals,141.0,2024-05-12T10:00:00
1,Chennai Super Kings,162.0,Punjab Kings,163.0,2024-05-01T14:00:00
2,Chennai Super Kings,212.0,Sunrisers Hyderabad,134.0,2024-04-28T14:00:00
3,Chennai Super Kings,210.0,Lucknow Super Giants,213.0,2024-04-23T14:00:00
4,Chennai Super Kings,141.0,Kolkata Knight Riders,137.0,2024-04-08T14:00:00
5,Delhi Capitals,208.0,Lucknow Super Giants,189.0,2024-05-14T14:00:00
6,Delhi Capitals,221.0,Rajasthan Royals,201.0,2024-05-07T14:00:00
7,Delhi Capitals,257.0,Mumbai Indians,247.0,2024-04-27T10:00:00
8,Delhi Capitals,224.0,Gujarat Titans,220.0,2024-04-24T14:00:00
9,Delhi Capitals,199.0,Sunrisers Hyderabad,266.0,2024-04-20T14:00:00


In [33]:
import pandas as pd
sdb = SportsDB()
last_5_games = sdb.get_last_5_games_for_league("One Day International Series")
pd.DataFrame(last_5_games)

Unnamed: 0,home,home_score,away,away_score,at_local
0,Afghanistan Cricket,183,Uganda Cricket,58,2024-06-04T00:30:00
1,Afghanistan Cricket,139,New Zealand Cricket,288,2023-10-18T09:30:00
2,Afghanistan Cricket,272,India Cricket,273,2023-10-11T09:30:00
3,Afghanistan Cricket,,Australia Cricket,,2022-11-04T08:00
4,Afghanistan Cricket,,New Zealand Cricket,,2022-10-26T09:00
...,...,...,...,...,...
95,Zimbabwe Cricket,,Bangladesh Cricket,,2022-10-30T03:00
96,Zimbabwe Cricket,,Pakistan Cricket,,2022-10-27T12:00
97,Zimbabwe Cricket,,South Africa Cricket,,2022-10-24T09:00
98,Zimbabwe Cricket,174,Ireland Cricket,143,2022-10-17T09:00:00


As you can see the code isn’t much longer than what we had originally, but now each method (functions in a class are called methods) perform a very specific task:

- `__make_query_string()` takes a dictionary of parameters can converts it into a url query string
- `call()` makes the API call and returns the relevant payload
- `get_teams_in_league()`, `get_last_5_games()` and `get_last_5_games_for_league()` are essentially the interface methods that users of your class will use to get data from <b>SportsDB</b>

This solves <i>Problem 1</i> above since the interface methods now contain only business logic, it is now much easier for a data analytst to write additional ones as required. 

Now that we've solved one problem, let's introduce another!

<i>Problem 2: SportsDB makes a big release and the base url has been updated, and they have started to limit the number of calls that can be made per minute, by returning a 429 http code when the limit is breached</i>

This should be a simple change, but since every function contains logic to make API calls, someone will have the job of going through every single function and updating the base url, as well as pasting in logic for error handling. That’s unpleasant and error prone.

Furthermore, now that all API calls are made through the `call()` method, we can also easily solve <i>Problem 2</i> above by adjusting the `call()` method slightly like so:

In [34]:
    # make API call - handle any http errors
    def call(self, method, params, payload_name):
        request_url = SportsDB.base_url + method + self.__make_query_string(params)
        response = requests.get(request_url)
        if response.status_code == 200:
            payload = response.json().get(payload_name, [])
        elif response.status_code == 429:
            # for this demo we simply catch it here - we could easily implement re-try after some specified time
            payload = "throttled by SportsDB"
        else:
            payload = "other error"
        return payload

<i>Problem 3: Some reporting runs require the event date / time presented as human readeable string (e.g. 14:00 22 Feb 2022) localized to any timezone</i>

We can enhance the function `get_last_5_games_for_league()` to handle this, but as the number of functions grow, you will likely have the same logic repeated over and over again, making the code difficult to maintain.

We also have a simple way of solving <i>Problem 3</i> above by introducing a constructor with an instance attribute representing the required reporting timezone, and an additional method `to_display_string()` as below:

In [35]:
    # constructor
    def __init__(self, timezone="Europe/London"):
        self.display_timezone = pytz.timezone(timezone)

In [93]:
    # convert a SportsDB timestamp to localized display string
    def to_display_string(self, timestamp_str):
        # handle some timestamp strings not have the offset due to data quality issue
        if len(timestamp_str)==25:
            dt_orig = dt.datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S%z") 
        else:
            dt_orig = dt.datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S")
        ts_orig = dt_orig.timestamp()
        dt_display = dt.datetime.fromtimestamp(ts_orig, self.display_timezone)
        return dt_display.strftime("%d %b %Y %H:%M")

Lastly, we will use the method `to_display_string()` in `get_last_5_games()` to add an event datetime in the reporting timezone, so that the final class looks like this:

In [54]:
import requests
import urllib.parse
from itertools import chain
import datetime as dt
import pytz

class SportsDB:
    base_url = "https://www.thesportsdb.com/api/v1/json/3/"

    # constructor
    def __init__(self, timezone="Europe/London"):
        self.display_timezone = pytz.timezone(timezone)
    
    # given dictionary of parameters return url query string
    def __make_query_string(self, params):
        query_string = "?"
        for param_name, param_value in params.items():
            query_string += param_name + "=" + param_value + "&"
        return query_string
    
    # convert a SportsDB timestamp to localized display string
    def to_display_string(self, timestamp_str):
        # handle some timestamp strings not have the offset due to data quality issue
        if len(timestamp_str)==25:
            dt_orig = dt.datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S%z") 
        else:
            dt_orig = dt.datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%S")
        ts_orig = dt_orig.timestamp()
        dt_display = dt.datetime.fromtimestamp(ts_orig, self.display_timezone)
        return dt_display.strftime("%d %b %Y %H:%M")


    # make API call - handle any http errors
    def call(self, method, params, payload_name):
        request_url = SportsDB.base_url + method + self.__make_query_string(params)
        response = requests.get(request_url)
        if response.status_code == 200:
            payload = response.json().get(payload_name, [])
        elif response.status_code == 429:
            # for this demo we simply catch it here - we could easily implement re-try after some specified time
            payload = "throttled by SportsDB"
        else:
            payload = "other error"
        return payload

    def get_teams_in_league(self, league_name):
        method = "search_all_teams.php"
        payload_name = "teams"
        params = {"l": urllib.parse.quote(league_name.encode("utf8"))}
        payload = self.call(method, params, payload_name)
        teams = [{"id": json["idTeam"], "name": json["strTeam"]} for json in payload]
        return teams

    def get_last_5_games(self, team_id):
        method = "eventslast.php"
        payload_name = "results"
        params = {"id": str(team_id)}
        payload = self.call(method, params, payload_name)
        games = [{
            "home": json["strHomeTeam"], 
            "home_score": json["intHomeScore"], 
            "away": json["strAwayTeam"], 
            "away_score": json["intAwayScore"],
            "at_local": json["strTimestamp"],
            "at_display": self.to_display_string(json["strTimestamp"])
        } for json in payload]
        return games

    def get_last_5_games_for_league(self, league):
        teams = self.get_teams_in_league(league)
        last_5_by_team = [self.get_last_5_games(team["id"]) for team in teams]
        last_5_by_team = list(chain.from_iterable(last_5_by_team))
        return last_5_by_team

We can now initialise this class with a desired reporting timezone, and use it like so:

In [55]:
import pandas as pd
sdb = SportsDB(timezone="Indian/Maldives")
last_5_games = sdb.get_last_5_games_for_league("Indian Premier League")
pd.DataFrame(last_5_games)

Unnamed: 0,home,home_score,away,away_score,at_local,at_display
0,Chennai Super Kings,145.0,Rajasthan Royals,141.0,2024-05-12T10:00:00,12 May 2024 09:30
1,Chennai Super Kings,162.0,Punjab Kings,163.0,2024-05-01T14:00:00,01 May 2024 13:30
2,Chennai Super Kings,212.0,Sunrisers Hyderabad,134.0,2024-04-28T14:00:00,28 Apr 2024 13:30
3,Chennai Super Kings,210.0,Lucknow Super Giants,213.0,2024-04-23T14:00:00,23 Apr 2024 13:30
4,Chennai Super Kings,141.0,Kolkata Knight Riders,137.0,2024-04-08T14:00:00,08 Apr 2024 13:30
5,Delhi Capitals,208.0,Lucknow Super Giants,189.0,2024-05-14T14:00:00,14 May 2024 13:30
6,Delhi Capitals,221.0,Rajasthan Royals,201.0,2024-05-07T14:00:00,07 May 2024 13:30
7,Delhi Capitals,257.0,Mumbai Indians,247.0,2024-04-27T10:00:00,27 Apr 2024 09:30
8,Delhi Capitals,224.0,Gujarat Titans,220.0,2024-04-24T14:00:00,24 Apr 2024 13:30
9,Delhi Capitals,199.0,Sunrisers Hyderabad,266.0,2024-04-20T14:00:00,20 Apr 2024 13:30


### MNIST Dataset Example

In [80]:
# Original code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier

# Load MNIST dataset (handwritten digits. Yan Le Cun's website: http://yann.lecun.com/exdb/mnist/)
mnist = fetch_openml('mnist_784')

# Preprocess data
X = mnist.data
y = mnist.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, random_state=42)
rfc.fit(X_train_scaled, y_train)

# Train a neural network classifier
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, random_state=42)
mlp.fit(X_train_scaled, y_train)

# Make predictions and evaluate the models
y_pred_rfc = rfc.predict(X_test_scaled)
y_pred_mlp = mlp.predict(X_test_scaled)

accuracy_rfc = accuracy_score(y_test, y_pred_rfc)
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)

print(f'RFC Accuracy: {accuracy_rfc:.3f}')
print(f'MLP Accuracy: {accuracy_mlp:.3f}')

print(classification_report(y_test, y_pred_rfc))
print(classification_report(y_test, y_pred_mlp))




RFC Accuracy: 0.964
MLP Accuracy: 0.965
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      2778
           1       0.98      0.99      0.98      3159
           2       0.95      0.97      0.96      2806
           3       0.95      0.94      0.95      2829
           4       0.96      0.97      0.97      2648
           5       0.96      0.96      0.96      2544
           6       0.97      0.99      0.98      2766
           7       0.97      0.96      0.96      2985
           8       0.96      0.94      0.95      2665
           9       0.95      0.95      0.95      2820

    accuracy                           0.96     28000
   macro avg       0.96      0.96      0.96     28000
weighted avg       0.96      0.96      0.96     28000

              precision    recall  f1-score   support

           0       0.98      0.97      0.98      2778
           1       0.98      0.99      0.98      3159
           2       0.96      0.96     

### Refactor Using OOP Principles

In [56]:
class DataLoader:
    def __init__(self, dataset_name):
        self.dataset_name = dataset_name

    def load_data(self):
        if self.dataset_name =='mnist_784':
            return fetch_openml(self.dataset_name)
        elif self.dataset_name == 'other_dataset':
            # Add support for other datasets here
            pass
        else:
            raise ValueError(f"Unsupported dataset: {self.dataset_name}")

class DataPreprocessor:
    def __init__(self):
        pass

    def preprocess(self, data):
        X = data.data
        y = data.target
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
        return X_train, X_test, y_train, y_test

class FeatureScaler:
    def __init__(self):
        pass

    def scale_features(self, X_train, X_test):
        scaler = StandardScaler()
        self.X_train = scaler.fit_transform(X_train)
        self.X_test = scaler.transform(X_test)
        return self.X_train, self.X_test

class ModelTrainer:
    def __init__(self):
        pass

    def train_model(self, model_type, X_train, y_train):
        if model_type == 'rfc':
            model = RandomForestClassifier(n_estimators=50, random_state=42)
        elif model_type =='mlp':
            model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, random_state=42)
        else:
            raise ValueError(f"Unsupported model type: {model_type}")
        model.fit(X_train, y_train)
        return model

class TestRunner:
    def __init__(self, data_loader, data_preprocessor, feature_scaler, model_trainer, model_evaluator):
        self.data_loader = data_loader
        self.data_preprocessor = data_preprocessor
        self.feature_scaler = feature_scaler
        self.model_trainer = model_trainer
        self.model_evaluator = model_evaluator

    def run_test(self, model_type):
        data = self.data_loader.load_data()
        X_train, X_test, y_train, y_test = self.data_preprocessor.preprocess(data)
        X_train_scaled, X_test_scaled = self.feature_scaler.scale_features(X_train, X_test)
        model = self.model_trainer.train_model(model_type, X_train_scaled, y_train)
        self.model_evaluator.evaluate_model(model, X_test_scaled, y_test)
        
class ModelEvaluator:
    def __init__(self):
        pass

    def evaluate_model(self, model, X_test, y_test):
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f'Accuracy: {accuracy:.3f}')
        print(classification_report(y_test, y_pred))

# Example usage
dataset_name ='mnist_784'
model_type = 'rfc'

data_loader = DataLoader(dataset_name)
data_preprocessor = DataPreprocessor(None)
feature_scaler = FeatureScaler(None, None)
model_trainer = ModelTrainer(None, None)
model_evaluator = ModelEvaluator(None, None, None)

test_runner = TestRunner(data_loader, data_preprocessor, feature_scaler, model_trainer, model_evaluator)
test_runner.run_test(model_type)

TypeError: DataPreprocessor.__init__() takes 1 positional argument but 2 were given

## Functional programming

In [24]:
usa_govt_percentages = [
    13.33,
    14.02,
    14.02,
    14.25,
    14.25,
    14.94,
    15.17,
    16.32,
    16.78,
    17.01,
    16.78,
    16.78,
    16.82,
    17.78,
    18.29,
    19.35,
    19.35,
    19.08,
    19.49,
    23.56,
    23.43,
    27.25,
    27.71,
    29.43,
]

In [26]:
usa_govt_proportions = list(map(lambda x: x / 100, usa_govt_percentages))

In [27]:
usa_govt_proportions

[0.1333,
 0.1402,
 0.1402,
 0.1425,
 0.1425,
 0.1494,
 0.1517,
 0.1632,
 0.1678,
 0.17010000000000003,
 0.1678,
 0.1678,
 0.16820000000000002,
 0.1778,
 0.18289999999999998,
 0.1935,
 0.1935,
 0.19079999999999997,
 0.1949,
 0.23559999999999998,
 0.2343,
 0.2725,
 0.2771,
 0.2943]

In [4]:
integer_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

squared_integers = list(map(lambda x: x**2, integer_list))

squared_integers

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [4]:
from functools import reduce
nested_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flattened_list = reduce(lambda x, y: x.extend(y) or x, nested_lists)
print(flattened_list)

[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [5]:
people = [
    {"name": "John", "age": 25},
    {"name": "Jane", "age": 30},
    {"name": "Tom", "age": 25},
]
def group_by_age(acc, person):
    age = person["age"]
    acc.setdefault(age, []).append(person)
    return acc
grouped_by_age = reduce(group_by_age, people, {})
print(grouped_by_age)

{25: [{'name': 'John', 'age': 25}, {'name': 'Tom', 'age': 25}], 30: [{'name': 'Jane', 'age': 30}]}


### Decorators in Python

Imagine you have a web application that has multiple functions that perform different tasks, such as authenticating users, retrieving data from a database, and sending emails. You want to log each function call, including the input arguments and the execution time, to help with debugging and performance monitoring.

Here's an example of how you can use a decorator to achieve this:

In [2]:
import time
import functools

def log_calls(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"{func.__name__} called with args {args} and kwargs {kwargs} in {end_time - start_time:.2f} seconds")
        return result
    return wrapper

@log_calls
def authenticate_user(username, password):
    # Simulate authentication logic
    time.sleep(1)
    return True

@log_calls
def retrieve_data_from_db(query):
    # Simulate database query
    time.sleep(2)
    return ["result1", "result2"]

@log_calls
def send_email(to, subject, body):
    # Simulate email sending
    time.sleep(3)
    return True

# Call the decorated functions
authenticate_user("john", "password")
retrieve_data_from_db("SELECT * FROM users")
send_email("john@example.com", "Hello", "This is a test email")

authenticate_user called with args ('john', 'password') and kwargs {} in 1.00 seconds
retrieve_data_from_db called with args ('SELECT * FROM users',) and kwargs {} in 2.01 seconds
send_email called with args ('john@example.com', 'Hello', 'This is a test email') and kwargs {} in 3.01 seconds


True

In this example, the <b>log_calls</b> decorator is defined to log each function call, including the input arguments and the execution time. The decorator uses the <b>functools.wraps</b> function to preserve the original function's metadata, such as its name and docstring.

The <b>authenticate_user</b>, <b>retrieve_data_from_db</b>, and <b>send_email</b> functions are decorated with the <b>log_calls</b> decorator, which means that each time these functions are called, the decorator will log the function call, including the input arguments and the execution time.

### MNIST Using Functional Programming

In [7]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

# Define a function to load the data
def load_data(dataset_name):
    if dataset_name =='mnist_784':
        return fetch_openml(dataset_name)
    elif dataset_name == 'other_dataset':
        # Add support for other datasets here
        pass
    else:
        raise ValueError(f"Unsupported dataset: {dataset_name}")

# Define a function to preprocess the data
def preprocess_data(data):
    X = data.data
    y = data.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

# Define a function to scale the features
def scale_features(X_train, X_test):
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    return X_train, X_test

# Define a function to train a model
def train_model(X_train, y_train, model_type):
    if model_type == 'rfc':
        model = RandomForestClassifier(n_estimators=50, random_state=42)
    elif model_type =='mlp':
        model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=50, random_state=42)
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    model.fit(X_train, y_train)
    return model

# Define a function to evaluate a model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    print(classification_report(y_test, y_pred))

# Define a function to visualize the data
def visualize_data(X_train, y_train, X_test, y_test):
    # Add visualization code here
    pass

# Define a function to run the test
def run_test(dataset_name, model_type):
    data = load_data(dataset_name)
    X_train, X_test, y_train, y_test = preprocess_data(data)
    X_train_scaled, X_test_scaled = scale_features(X_train, X_test)
    model = train_model(X_train_scaled, y_train, model_type)
    evaluate_model(model, X_test_scaled, y_test)
    visualize_data(X_train, y_train, X_test, y_test)

# Example usage
dataset_name ='mnist_784'
model_type = 'rfc'

run_test(dataset_name, model_type)

Accuracy: 0.964
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1343
           1       0.98      0.98      0.98      1600
           2       0.95      0.97      0.96      1380
           3       0.95      0.95      0.95      1433
           4       0.95      0.97      0.96      1295
           5       0.97      0.96      0.96      1273
           6       0.97      0.98      0.98      1396
           7       0.97      0.97      0.97      1503
           8       0.96      0.94      0.95      1357
           9       0.95      0.94      0.95      1420

    accuracy                           0.96     14000
   macro avg       0.96      0.96      0.96     14000
weighted avg       0.96      0.96      0.96     14000

