<a href="https://colab.research.google.com/github/Ibraheem101/mlops/blob/main/mlops/mlops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MLOps project**

In [15]:
import os
import re
import json
import math
import nltk
import torch
import gensim
import random
import urllib
import warnings
import itertools
import collections
import numpy as np
import pandas as pd
import seaborn as sns
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

In [16]:
sns.set_theme()
warnings.filterwarnings("ignore")

## **Systems Design**

![systems design](https://madewithml.com/static/images/mlops/systems-design/workloads.png)
[madewithml](https://madewithml.com/courses/mlops/systems-design/)

## **Data**


### **Preparation**

#### **Data ingestion**
Data ingestion in the MLOps (Machine Learning Operations) cycle refers to the process of collecting and importing data from various sources into the machine learning pipeline for analysis, processing, and modeling. It is a critical step in the MLOps workflow as the quality and reliability of the ingested data directly impact the performance and accuracy of the machine learning models.

In [4]:
# Data ingestion
DATASET_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/dataset.csv"
df = pd.read_csv(DATASET_LOC)
df.head()

Unnamed: 0,id,created_on,title,description,tag
0,6,2020-02-20 06:43:18,Comparison between YOLO and RCNN on real world...,Bringing theory to experiment is cool. We can ...,computer-vision
1,7,2020-02-20 06:47:21,"Show, Infer & Tell: Contextual Inference for C...",The beauty of the work lies in the way it arch...,computer-vision
2,9,2020-02-24 16:24:45,Awesome Graph Classification,"A collection of important graph embedding, cla...",graph-learning
3,15,2020-02-28 23:55:26,Awesome Monte Carlo Tree Search,A curated list of Monte Carlo tree search pape...,reinforcement-learning
4,25,2020-03-07 23:04:31,AttentionWalk,"A PyTorch Implementation of ""Watch Your Step: ...",graph-learning


#### **Data splitting**

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
df.tag.value_counts()

natural-language-processing    310
computer-vision                285
mlops                           63
reinforcement-learning          45
graph-learning                  33
time-series                     28
Name: tag, dtype: int64

What are the criteria we should focus on to ensure proper data splits?
* Randomness
* Stratification (for imbalanced datasets)
* Sufficient data size in each subset
* Temporal data handling (for time-series data)
* Consistency across models
* Reproducibility with random seed
* Adequate validation set size for hyperparameter tuning
* Complete separation of the test set from training and validation data.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 764 entries, 0 to 763
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           764 non-null    int64 
 1   created_on   764 non-null    object
 2   title        764 non-null    object
 3   description  764 non-null    object
 4   tag          764 non-null    object
dtypes: int64(1), object(4)
memory usage: 30.0+ KB


In [9]:
test_size = 0.2
train_df, val_df = train_test_split(df, stratify=df.tag, test_size=test_size, random_state=1234)

In [10]:
train_df.tag.value_counts()

natural-language-processing    248
computer-vision                228
mlops                           50
reinforcement-learning          36
graph-learning                  26
time-series                     23
Name: tag, dtype: int64

In [11]:
val_df.tag.value_counts()

natural-language-processing    62
computer-vision                57
mlops                          13
reinforcement-learning          9
graph-learning                  7
time-series                     5
Name: tag, dtype: int64

In [12]:
# Validation (adjusted) value counts
val_df.tag.value_counts() * int((1-test_size) / test_size)

natural-language-processing    248
computer-vision                228
mlops                           52
reinforcement-learning          36
graph-learning                  28
time-series                     20
Name: tag, dtype: int64

### **Exploratory Data Analysis**

In [17]:
from collections import Counter
from wordcloud import WordCloud, STOPWORDS