# Essential Knowledge for LLMs, Data Science, and Non-Reinforcement Learning ML

This notebook covers fundamental concepts and implementations for Large Language Models (LLMs), Data Science, and non-Reinforcement Learning Machine Learning techniques. It is designed to provide a solid foundation for roles in data science, NLP, and general machine learning development.

## Table of Contents
1. **Large Language Models (LLMs)**
    - Overview of LLMs and Transformer Architectures
    - Tokenization and Embeddings
    - Building and Fine-Tuning with the `transformers` library
    - Text Generation and Sentiment Analysis
2. **Data Science Essentials**
    - Data Preprocessing and Feature Engineering
    - Exploratory Data Analysis (EDA)
    - Supervised and Unsupervised Learning Techniques
3. **Non-Reinforcement Learning Machine Learning**
    - Time Series Analysis and Forecasting
    - Anomaly Detection
    - Transfer Learning with Pre-trained Models


## 1. Large Language Models (LLMs)

### 1.1 Overview of LLMs and Transformer Architectures
Large Language Models (LLMs) are deep learning models trained on massive text datasets. They use Transformer architectures, which leverage attention mechanisms to capture relationships between words in a sequence.

**Examples**:
- **GPT (Generative Pre-trained Transformer)**: A model that can generate text, answer questions, and perform various NLP tasks.
- **BERT (Bidirectional Encoder Representations from Transformers)**: Designed for understanding the context of words in a sentence.

**Key Components of Transformers**:
- **Self-Attention Mechanism**: Determines which words in a sentence are important with respect to each other.
- **Positional Encoding**: Adds information about the position of words in a sentence.


In [None]:
# 1.2 Tokenization and Embeddings

# Using the Hugging Face Transformers library to tokenize text and create embeddings

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample sentence
sentence = "Machine learning is fascinating!"

# Tokenize the sentence
inputs = tokenizer(sentence, return_tensors='pt')

# Get embeddings from BERT
with torch.no_grad():
    outputs = model(**inputs)

# The last hidden state of the model (embeddings for each token)
embeddings = outputs.last_hidden_state
print(f"Shape of embeddings: {embeddings.shape}")


### 1.3 Building and Fine-Tuning with the `transformers` library
We can fine-tune a pre-trained model like BERT or GPT-2 for specific tasks such as text classification or text generation.

**Example**: Fine-tuning BERT for sentiment analysis.


In [None]:
# Example code for fine-tuning a pre-trained BERT model for sentiment analysis

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import pandas as pd

# Load a sample dataset for sentiment analysis
data = {'text': ["I love machine learning!", "This is a boring task.", "Deep learning is amazing."],
        'label': [1, 0, 1]}
df = pd.DataFrame(data)

# Train-test split
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'], df['label'], test_size=0.2)

# Tokenize the text data
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, return_tensors='pt')

# Create PyTorch datasets
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SentimentDataset(train_encodings, train_labels)
val_dataset = SentimentDataset(val_encodings, val_labels)

# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train and evaluate
trainer.train()
trainer.evaluate()


## 2. Data Science Essentials

### 2.1 Data Preprocessing and Feature Engineering
Data preprocessing is a critical step in data science. It includes handling missing values, scaling numerical features, and encoding categorical variables.

**Example**: Using `pandas` and `scikit-learn` for basic data preprocessing tasks.


In [None]:
# Basic Data Preprocessing with Pandas and Scikit-learn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create a sample dataframe
data = {'age': [25, 32, 47, 51, 23, np.nan],
        'salary': [50000, 54000, 58000, 60000, 52000, 59000],
        'department': ['HR', 'Engineering', 'Engineering', 'HR', 'HR', 'Engineering']}

df = pd.DataFrame(data)

# Handle missing values
df['age'].fillna(df['age'].mean(), inplace=True)

# One-hot encode categorical variables
encoder = OneHotEncoder(sparse=False)
department_encoded = encoder.fit_transform(df[['department']])

# Standardize numerical features
scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

# Concatenate the encoded and scaled features
encoded_df = pd.DataFrame(department_encoded, columns=encoder.get_feature_names_out(['department']))
df.reset_index(drop=True, inplace=True)
final_df = pd.concat([df, encoded_df], axis=1).drop(['department'], axis=1)

print(final_df)


## 3. Non-Reinforcement Learning Machine Learning

### 3.1 Time Series Analysis and Forecasting
Time series analysis involves techniques for analyzing time-ordered data points to forecast future values.

**Example**: Building a Long Short-Term Memory (LSTM) network for time series forecasting using TensorFlow/Keras.


In [None]:
# Time Series Forecasting with LSTM using TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import numpy as np
import matplotlib.pyplot as plt

# Generate a synthetic time series dataset
def create_dataset(n_points=1000):
    time = np.arange(0, n_points)
    data = np.sin(0.1 * time) + np.random.normal(0, 0.1, size=n_points)  # Sine wave with noise
    return data

# Create dataset
data = create_dataset()
n_timesteps = 50

# Prepare the dataset for LSTM
X, y = [], []
for i in range(len(data) - n_timesteps):
    X.append(data[i:i+n_timesteps])
    y.append(data[i+n_timesteps])
X, y = np.array(X), np.array(y)

# Split into train and test sets
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Build LSTM model
model = Sequential([
    LSTM(50, activation='relu', input_shape=(n_timesteps, 1)),
    Dense(1)
])
model.compile(optimizer='adam', loss='mse')

# Reshape input data for LSTM
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Train the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

# Evaluate and predict
loss = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}')
y_pred = model.predict(X_test)

# Visualize results
plt.figure(figsize=(10, 6))
plt.plot(range(len(y_test)), y_test, label='Actual')
plt.plot(range(len(y_pred)), y_pred, label='Predicted')
plt.title('LSTM Time Series Forecasting')
plt.xlabel('Time Steps')
plt.ylabel('Value')
plt.legend()
plt.show()
