# ML models implementation

## üß† BASIC MACHINE LEARNING ALGORITHMS FOR DATA SCIENCE


### üîπ **1. Supervised Learning**

#### üìä Classification
1. **Logistic Regression**
   - Linear decision boundary, outputs probabilities.
   - Regularization (L1, L2), sigmoid function.

2. **Decision Trees**
   - Interpretability, prone to overfitting, splits based on Gini/Entropy.

3. **Random Forest**
   - Ensemble of trees (bagging), reduces variance, improves generalization.

4. **Gradient Boosting (e.g., XGBoost, LightGBM)**
   - Sequentially adds models to reduce error.
   - Highly accurate, handles missing data, feature importance.

5. **Support Vector Machines (SVM)**
   - Finds optimal hyperplane, uses kernels for non-linear data.
   - Parameters: `C`, `gamma`, kernel type.

6. **K-Nearest Neighbors (KNN)**
   - No training, lazy learning.
   - Distance-based classification (Euclidean, Manhattan).

7. **Naive Bayes**
   - Probabilistic classifier based on Bayes‚Äô theorem.
   - Assumes feature independence.

---

#### üìà Regression
1. **Linear Regression**
   - Models linear relationship.
   - Evaluate with R¬≤, MSE, residual plots.

2. **Ridge & Lasso Regression**
   - Regularization techniques:
     - Ridge (L2): Shrinks coefficients.
     - Lasso (L1): Shrinks and can zero out features (feature selection).

3. **Polynomial Regression**
   - Extends linear regression for curved relationships.

4. **ElasticNet**
   - Mix of L1 and L2 regularization.

---

### üîπ **2. Unsupervised Learning**

#### üß© Clustering
1. **K-Means Clustering**
   - Partitions data into k clusters.
   - Distance-based, sensitive to initialization.

2. **Hierarchical Clustering**
   - Dendrograms, doesn‚Äôt require specifying k.

3. **DBSCAN**
   - Density-based clustering, handles noise and clusters of varying shapes.

---

#### üìâ Dimensionality Reduction
1. **PCA (Principal Component Analysis)**
   - Projects data into lower dimensions while preserving variance.

2. **t-SNE / UMAP**
   - Good for visualization of high-dimensional data.

3. **Truncated SVD**
   - For sparse matrices, used with text data (e.g., TF-IDF).

---

### üîπ **3. Time Series & Sequence Modeling**
1. **ARIMA / SARIMA**
   - Classical forecasting methods (trend, seasonality).

2. **Exponential Smoothing**
   - Weighted average of past observations.

3. **Recurrent Neural Networks (RNNs), LSTM, GRU**
   - Deep learning models for sequences (if deep learning is expected).

---

### üîπ **4. Other Important Techniques**
- **Cross Validation** (k-fold, stratified)
- **GridSearchCV / RandomizedSearchCV** for hyperparameter tuning
- **Feature Selection** (Univariate, Recursive Feature Elimination)
- **Handling Imbalanced Data** (SMOTE, class weights)
- **Evaluation Metrics:**
  - Classification: accuracy, precision, recall, F1, ROC-AUC
  - Regression: MSE, RMSE, MAE, R¬≤

---

### üß™ How to Prepare for Each
| Algorithm                | What to Know | Practice |
|-------------------------|--------------|----------|
| Logistic Regression     | Math, sigmoid, loss function | `sklearn.linear_model.LogisticRegression` |
| Random Forest           | Ensemble concept, overfitting control | `RandomForestClassifier`, plot feature importances |
| Gradient Boosting       | Sequential learning, shrinkage | `XGBoost`, `LightGBM` |
| SVM                     | Margin, kernels | `SVC` with different kernels |
| KNN                     | Distance metrics | `KNeighborsClassifier` |
| Linear Regression       | Cost function, assumptions | `LinearRegression`, plot residuals |
| PCA                     | Eigenvectors, explained variance | `PCA` from `sklearn.decomposition` |
| K-Means                 | Elbow method, silhouette score | `KMeans`, cluster plots |

---

## **Python libraries** 

### ‚úÖ **Core Python Libraries You Should Know**

### üî¢ **Numerical & Matrix Operations**
- **`NumPy`**
  - Array creation, indexing/slicing, broadcasting
  - Matrix multiplication (`dot`, `matmul`), reshaping, linear algebra

- **`SciPy`**
  - `scipy.stats` for statistical testing (e.g., t-tests, p-values)
  - Optimization and numerical integration

---

### üêº **Data Wrangling & EDA**
- **`pandas`**
  - DataFrames, Series, filtering, merging, `groupby`, missing value handling
  - `pivot`, `melt`, `.apply()`, `.map()`, `.value_counts()`

- **`matplotlib` & `seaborn`**
  - Plotting histograms, barplots, boxplots, pairplots, heatmaps
  - EDA visualizations (correlations, distributions)

---

### üß† **Machine Learning & Model Evaluation**
- **`scikit-learn`**
  - Models: LogisticRegression, RandomForest, SVC, KMeans, PCA
  - Pipelines, `train_test_split`, `cross_val_score`, `GridSearchCV`
  - Preprocessing: `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`
  - Metrics: `accuracy_score`, `confusion_matrix`, `roc_auc_score`

- **`xgboost`**
  - Gradient boosting classifier and regressor
  - Feature importance, `DMatrix` optimization

- **`lightgbm`** *(optional, but nice to know)*
  - Another gradient boosting framework (faster/lighter than XGBoost)

---

### üìä **Text Processing & Feature Engineering**
- **`sklearn.feature_extraction.text.TfidfVectorizer`**
  - Convert raw text to TF-IDF feature matrix
- **`nltk` or `spaCy`** *(if text data is involved)*
  - Tokenization, stopwords, stemming/lemmatization

---

### üßÆ **Deep Learning (maybe optional)**
- **`tensorflow` / `keras`** or **`torch`**
  - If they ask about ResNet, activation functions, vanishing gradients
  - Know basic model building syntax if deep learning comes up

---

### üß™ **Other Useful Utilities**
- **`statsmodels`**
  - For statistical modeling (e.g., linear regression, p-values)

- **`sqlalchemy`** or **`sqlite3`**
  - If they bring up SQL with Python integration

- **`langchain`** *(mentioned in the job description)*
  - May come up if they ask about LLMs or generative AI pipelines (you just need to know what it does)

---

### üîç Your Must-Know Set for Interview
If you want to keep it tight and efficient, prioritize:
```bash
    numpy
    pandas
    scipy
    matplotlib
    seaborn
    scikit-learn
    xgboost
    statsmodels
```

---

# Library Practice

### Matrix reshaping and manipulations

In [23]:
# This library is for fast matrix operations
import numpy as np 

a = np.array([1,2,3]) # 1-d with 3 elments
b = np.array([[2,3],[1,6]]) #matrix shape (2x2) 2 row x 2 columns 

print('a.shape:', a.shape)
print('b.shape:', b.shape)

print('b array:\n', b)
c = b.reshape(1,4) # reshaped it to [2,3,1,6] 1 row x 4 columns
print('b.reshaped array:\n', c)
print('b.reshaped array shape:', c.shape)

a.shape: (3,)
b.shape: (2, 2)
b array:
 [[2 3]
 [1 6]]
b.reshaped array:
 [[2 3 1 6]]
b.reshaped array shape: (1, 4)


### Stats of arrays 

In [24]:
d = np.dot(a,a)
print('a dot product:', d)

d = np.mean(a)
print('mean of array a:', d)

d = np.std(a)
print('Std of array a:', d)

a dot product: 14
mean of array a: 2.0
Std of array a: 0.816496580927726


### Matrix Muliplication 
Remember in matrix muliplication Colunms of Matrix 1 needs to equal to Rows of Matrix 2

In [None]:
# d = a * b # will not work due to array a = 3 columns and b = 2 rows 
a = np.array([[2,3,4],[5,3,2]])
b = np.array([[2,3],[5,4],[3,1]])
print(a.shape)
print(b.shape)

matrix1R, Matrix2R = a.shape


(2, 3)
(3, 2)
