Atalov S.

Fundamentals of Machine Learning and Artificial Intelligence

# Semester Project: Practical Application of Machine Learning Techniques
---



### Objective:
The objective of this semester project is to apply the machine learning techniques you've learned in this course to a real-world problem. You will select a dataset, define a problem statement, apply appropriate machine learning models, and critically analyze your results.

### Project Requirements:

1. **Selection of a Dataset and Problem Statement**:
   - Choose a dataset that is publicly available and has not been extensively analyzed in popular machine learning literature or online competitions.
   - Define a clear and concise problem statement. This could involve prediction, classification, anomaly detection, or any other suitable task based on the dataset.

2. **Data Preprocessing**:
   - Conduct necessary preprocessing steps such as handling missing data, feature scaling, and encoding categorical variables.
   - Provide a thorough exploratory data analysis (EDA) with visualizations to understand the features and relationships within the data.

3. **Model Implementation**:
   - Implement at least three different machine learning models. At least one of the models should be from advanced machine learning techniques (e.g., ensemble methods).
   - Justify the choice of each model and discuss the assumptions relevant to each.

4. **Model Evaluation and Selection**:
   - Split the data into training, validation, and testing sets.
   - Evaluate the performance of the models using appropriate metrics (accuracy, precision, recall, F1-score, ROC curve, etc.).
   - Fine-tune the models using techniques like cross-validation or hyperparameter optimization.
   - Select the best-performing model based on the evaluation metrics and provide a rationale for your selection.

5. **Results and Discussion**:
   - Present the results of the best model in a clear and understandable manner.
   - Discuss the implications of your findings in the context of the problem statement.
   - Identify any potential biases or limitations in your model and dataset.

6. **Documentation and Code**:
   - Document all aspects of your project in a clear, structured report.
   - Include code in a Jupyter Notebook or similar format, with comments explaining the code and decisions made at each step.

### Deliverables:
1. **Report**: A detailed report/presentation (minimum of 5 pages / 10 slides) including Introduction, Methods, Results, Discussion, and Conclusions sections.
2. **Code**: A Jupyter notebook or Python scripts containing all code used in the project, adequately commented.

### Evaluation Criteria:
- **Originality and Difficulty of the Problem Statement** (2 points)
- **Quality and Depth of the Data Analysis** (2 points)
- **Innovativeness and Appropriateness of the Model Approach** (2 points)
- **Accuracy and Complexity of the Model Evaluation** (2 points)
- **Quality of the Final Report and Code** (2 points)

### Submission:
Submit the final report (presentation) and code via the ecourse one week before the Final Exam.

---


### Additional Options for the Semester Project:

In addition to the practical application project described above, students have the option to undertake one of the following alternatives based on their interests and career goals:

1. **Development of an Innovative Machine Learning Application**:
   - Propose and develop an innovative application of machine learning that solves a unique problem or enhances existing solutions.
   - This could involve the development of a software tool, a web application, or a mobile app that utilizes machine learning techniques.
   - Document the design, development process, challenges faced, and the impact or potential impact of the application.

2. **Theoretical Research Paper**:
   - Conduct a detailed study on a specific machine learning algorithm or a comparative analysis of several algorithms.
   - Your paper should cover the theoretical foundations of the algorithm, variations and improvements over the basic algorithm, advantages and limitations, and potential applications.
   - Review recent research articles and include a section on future directions for research based on current challenges and advancements.
   - Вы можете написать статью на русском языке (требования остаются такими же).

### Guidelines for Alternative Projects:

- **Proposal Submission**: Before starting, submit a detailed proposal outlining the chosen project or research topic, objectives, methodology, expected outcomes, and a timeline. This proposal must be approved by the course instructor.
  
- **Final Presentation**: At the end of the semester, present your project or research findings to the class. This presentation should include a comprehensive overview of your work, key findings or demonstrations, and a Q&A session.
  
- **Documentation**: 
  - For application development, provide documentation that includes user guides, technical architecture, and code comments.
  - For research papers, follow academic writing standards and cite all sources properly in your bibliography.
  
### Evaluation Criteria for Alternative Projects:
- **Innovation and Originality** (2 points)
- **Depth of Technical Content** (2 points)
- **Practical or Theoretical Value** (2 points)
- **Quality and Clarity of Presentation** (2 points)
- **Comprehensiveness of Documentation or Paper** (2 points)

These alternatives are designed to cater to students who wish to delve deeper into the technical aspects or real-world applications of machine learning, providing them with a platform to showcase their specialized skills or research capabilities.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import ConvergenceWarning
from joblib import Parallel, delayed
import warnings

In [2]:
class GeneticAlgorithmStackingEnsemble:
    def __init__(self, X, y, param_grid, population_size=100, generations=5, n_jobs=-1):
        self.param_grid = param_grid
        self.X = X
        self.y = y
        self.population_size = population_size
        self.generations = generations
        self.scaler = StandardScaler()
        self.X = self.scaler.fit_transform(self.X)
        self.n_jobs = n_jobs
        self.final_population = self.genetic_algorithm(self.X, self.y, self.param_grid)
        self.base_models = self.create_base_models(self.final_population)
        self.stacking_clf = StackingClassifier(estimators=self.base_models, final_estimator=LogisticRegression(max_iter=1000, solver='saga'))

    @staticmethod
    def create_population(param_grid, population_size):
        population = []
        for _ in range(population_size):
            individual = {}
            for model, params in param_grid.items():
                individual[model] = {key: np.random.choice(values) for key, values in params.items()}
            population.append(individual)
        return population

    @staticmethod
    def evaluate_individual(individual, X, y, n_jobs):
        scores = []
        for model_name, params in individual.items():
            if model_name == 'RandomForest':
                model = RandomForestClassifier(**params)
            elif model_name == 'SVM':
                model = SVC(**params)
            elif model_name == 'DecisionTree':
                model = DecisionTreeClassifier(**params)
            elif model_name == 'LogisticRegression':
                model = LogisticRegression(**params, max_iter=1000, solver='saga')
            score = np.mean(cross_val_score(model, X, y, cv=5, n_jobs=n_jobs))
            scores.append(score)
            print(f"{model_name}:{score}")
        return np.mean(scores)

    def genetic_algorithm(self, X, y, param_grid):
        population = self.create_population(param_grid, self.population_size)
        for generation in range(self.generations):
            scores = Parallel(n_jobs=self.n_jobs)(delayed(self.evaluate_individual)(individual, X, y, self.n_jobs) for individual in population)
            best_individuals = [population[i] for i in np.argsort(scores)[-50:]]
            new_population = []

            # Добавление кроссоверных моделей
            for _ in range(25):
                parent1, parent2 = np.random.choice(best_individuals, 2, replace=False)
                if self.check_compatible_models(parent1, parent2):
                    child1, child2 = self.crossover(parent1, parent2)
                    new_population.extend([child1, child2])

            # Добавление мутированных моделей
            for i in range(25):
                individual = self.mutate(best_individuals[i % len(best_individuals)], param_grid)
                new_population.append(individual)

            # Объединение текущей популяции с новыми моделями
            population.extend(new_population)
            
            # Уникальность новых моделей
            population = self.ensure_unique(population, param_grid)
            
            # Ограничение размера популяции
            population = population[:self.population_size]
            print(population)
        return population

    @staticmethod
    def check_compatible_models(model1, model2):
        return model1.keys() == model2.keys()

    @staticmethod
    def crossover(parent1, parent2):
        child1, child2 = {}, {}
        for model in parent1.keys():
            child1[model], child2[model] = {}, {}
            for param in parent1[model].keys():
                if np.random.rand() > 0.5:
                    child1[model][param] = parent1[model][param]
                    child2[model][param] = parent2[model][param]
                else:
                    child1[model][param] = parent2[model][param]
                    child2[model][param] = parent1[model][param]
        return child1, child2

    def mutate(self, individual, param_grid):
        model = np.random.choice(list(individual.keys()))
        param = np.random.choice(list(individual[model].keys()))
        if param in param_grid[model]:
            individual[model][param] = np.random.choice(param_grid[model][param])
        else:
            # If param is not in param_grid, randomly select another param to mutate
            other_params = list(set(param_grid[model].keys()) - {param})
            if other_params:
                new_param = np.random.choice(other_params)
                individual[model][new_param] = np.random.choice(param_grid[model][new_param])
        return individual

    @staticmethod
    def ensure_unique(population, param_grid):
        seen = set()
        unique_population = []
        for individual in population:
            serialized = tuple((model, tuple(params.items())) for model, params in individual.items())
            if serialized not in seen:
                seen.add(serialized)
                unique_population.append(individual)
            else:
                unique_population.append(GeneticAlgorithmStackingEnsemble.create_individual(param_grid))
        return unique_population

    @staticmethod
    def create_individual(param_grid):
        individual = {}
        for model, params in param_grid.items():
            individual[model] = {key: np.random.choice(values) for key, values in params.items()}
        return individual

    def create_base_models(self, final_population):
        base_models = []
        for individual in final_population:
            for model_name, params in individual.items():
                if model_name == 'RandomForest':
                    model = RandomForestClassifier(**params)
                elif model_name == 'SVM':
                    model = SVC(**params)
                elif model_name == 'DecisionTree':
                    model = DecisionTreeClassifier(**params)
                elif model_name == 'LogisticRegression':
                    model = LogisticRegression(**params, max_iter=1000, solver='saga')
                base_models.append((f"{model_name}_{len(base_models)}", model))
        return base_models

    def fit(self, X_train, y_train):
        self.stacking_clf.fit(self.scaler.transform(X_train), y_train)

    def predict(self, X_test):
        return self.stacking_clf.predict(self.scaler.transform(X_test))

    def score(self, X_test, y_test):
        return accuracy_score(y_test, self.predict(X_test))

In [3]:
df = pd.read_csv("empl_train.csv")
df = df.dropna()
df["City"] = df["City"].replace({"Ош":"Osh", "Биш":"Bishkek"})
df = pd.get_dummies(df,columns=["Education", "City", "Gender", "EverBenched"])
df["JoiningYear"] = df["JoiningYear"].replace({"2o22":2022, "2o2o": 2020, "2o17": 2017, "2o21":2021}).astype(int)

In [4]:
X = df.copy()
y = X.pop("Leave")
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, stratify=y)

In [None]:
# Гиперпараметры для моделей
param_grid = {
    'RandomForest': {'n_estimators': [10, 50, 100], 'max_depth': [1,2,3,4,5,6,7,8,9,10,12,15,17,20]},
    'SVM': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'DecisionTree': {'max_depth': [1,2,3,4,5,6,7,8,9, 10,10,12,13,15,20]}
}

ensemble = GeneticAlgorithmStackingEnsemble(X, y, param_grid)
ensemble.fit(X_train, y_train)
print(f"Stacking Classifier Accuracy: {ensemble.score(X_test, y_test)}")