# Object-Oriented Programming
1.) Encapsulation = Bundling attributes and methods into a class (the class encapsulates, or hides all of it, when you call it)

2.) Abstraction = When you have class that defines lotta instances and wanna give broadly applicable method, that's abstraction

In [None]:
from abc import ABC, abstractmethod

class Shape(ABC):
    @abstractmethod
    def area(self):
        pass

class Circle(Shape):
    def __init__(self, radius):
        self.radius = radius

    def area(self):
        return 3.14 * self.radius ** 2

3.) Inheritance = Class borrowing traits from other class

In [None]:
class Animal:
    def speak(self):
        print("Animal speaks")

class Dog(Animal):
    def bark(self):
        print("Dog barks") # Dog borrows speak() from Animal

4.) Polymorphism = Treating different objects in same way (if have same methods)

In [None]:
class Animal:
    def speak(self):
        pass

class Dog(Animal):
    def speak(self):
        print("Dog barks")

class Cat(Animal):
    def speak(self):
        print("Cat meows")

def make_animal_speak(animal):
    animal.speak()

dog = Dog()
cat = Cat()

make_animal_speak(dog)  # Output: Dog barks
make_animal_speak(cat)  # Output: Cat meows

# .gitignore
Your repo ignores all things specified inside file (ex; .exe files, anything in "/vendor files)

# Interpreter
Thing that takes all your code, translates into computer language, tells computer to run, and raises errors if your code is wack

# Scope
- Avoid name collisions
- (L)ocal = Only available to code in its nest
- (E)nclosing = Nesting functions, code from outer nests available to inner.
- (G)lobal = Available to all your code
- (B)uilt-in = All code inherent to Python

In [None]:
# Global:
global_var = "I am global"

def my_function():
    # Local:
    local_var = "I am local"
    print(local_var)  # Accessing local

        # An inner function here could access local... enclosing.

    # Accessing global
    print(global_var)

    # Accessing built-in
    print(len("Hello"))  # 'len' is built-in

# Accessing built-in function
print(max(3, 5))  # 'max' is a built-in function

# Python List
- my_list = [1, a, 3.2]
- Add / remove / slice stuff
- Store different types
- Slower computation, especially when big

# NumPy Array
- np.array([1, 2, 3, 4, 5])
- FIXED size
- Same data type
- Fast computation

# Data Types:
int = Whole numbers (e.g., 5, -3)

float = Decimals (3.14, -0.5)

bool = True/False

str = Test ("hello", 'python')

list = Ordered collection ([1, 2, 2])

tuple = IMMUTABLE list ((1, 2, 2))

dict = Collection of key-value pairs ({'name': 'John', 'age': 30})

set = unordered, list ALL UNIQUE items ({'apple', 'banana'})

none = NaN

# Class:

In [None]:
# Declaration
class Dog:
    pass

# Attribute = var defining entire class
class Dog:
    species = "Canine"

# Method = func entire class can do
class Dog:
    def bark(self):
        return "Woof!"
    
# Instance = custom version of class
my_dog = Dog()
print(my_dog.species)  # Output: Canine
print(my_dog.bark())  # Output: Woof!

# Initialization = Class expects inputs for each instance, to further define it
class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age

my_dog = Dog("Buddy", 3)
print(my_dog.name)  # Output: Buddy
print(my_dog.age)   # Output: 3

# Truthiness
1. Falsy = Bool evaluates as "False" (ex; False, None, [], {}, "", set(), 0, 0.0)
2. Truthy = Bool evaluates as "True" (ex; everything else)

In [None]:
if 1:
    print("1 is truthy")  # Output: 1 is truthy

if "hello":
    print("hello is truthy")  # Output: hello is truthy

if 0:
    print("0 is truthy")
else:
    print("0 is falsy")  # Output: 0 is falsy

if []:
    print("Empty list is truthy")
else:
    print("Empty list is falsy")  # Output: Empty list is falsy

# Central Tendency
- Mean 
- Median (less affected by outliers)
- Mode (also)
- Quartiles = data split 4 parts, 1.5*Q3 = outlier cutoff
- Standard deviation


Symmetrical = Mean / Median / Mode (on top of one another)

Positive skew = Mode --> Median --> Mean (dragged right)

Negative skew = Mean --> Median --> Mode (dragged left)
- Replace null values with the mean when data is normally distributed    /    with median when data is skewed

# Correlation Coeffecient

1. Correlation Coefficient = r
   - r = 1 = perfect positive
   - r = -1 = perfect negative
   - r = 0 = NUN

2. Coefficient of Determination = r^2
   - Amount of variance in predictions from actual
   - r^2 = 0 = NONE of variability in dependent explained
   - r^2 = 1 = ALL variability in dependent explained

# Random Variables
Discrete = Variable exact, whole number

Continuous = Could keep gettin fractional w it

Probability Density Function (PDF) = Probability variable will fall in CONTINUOUS range

Cumulative Distribution Function (CDF) = Probability variable less than or equal to certain value

Normal Distribution = Bell Curve
- Central Limit Theorem = Take buncha diff samples and combine their graphs and it'll eventually ^
- Law of Large Nums = Take buncha diff samples and avg their means and it'll be pop mean

Random State / Randseed = Computer chooses one of infinite "random" generations to calculate with

# Scales of Measurement
- Nominal = No order no meaning (ex; eye color / diff fruits)
- Ordinal = Order meaning (ex; 1, 2, 3, 4 / A, B, C, D, F)
- Interval = Order no true 0 (ex; temperatures)
- Ratio = Order true 0 (ex; height, weight)

# St(andar)din(put) and St(andar)dout(put)
1. However computer expects info for program (ex; keyboard, reading files)
2. Wherever it runs / displays program (ex; monitor, writing onto files)

# Cleaning

- Handling Missing Values - ex; df['age'].fillna(df['age'].mean(), inplace=True) fills missing age values with the mean age.
- Removing Duplicates - ex; df.drop_duplicates(inplace=True) removes duplicate rows from the DataFrame
- Handling Outliers - ex; upper_threshold = df['income'].quantile(0.99) calculates 99th percentile of income as cutoff
- Standardizing Data - ex; df['date'] = df['date'].map(lambda x: x.strftime('%Y-%m-%d')) converts dates to the YYYY-MM-DD
- Correcting Inconsistent Values - ex; df['gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True) replaces 'M' with 'Male' and 'F' with 'Female' in ['gender']
- Normalization and Scaling - ex; scaler = MinMaxScaler(); df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']]) scales age and income to 0 --> 1 range
- Feature Engineering - ex; df['day_of_week'] = df['date'].apply(lambda x: x.day_name()) extracts day of the week from ['date']
- Handling Irrelevant Features - ex; df.drop(['unnecessary_column'], axis=1, inplace=True) drops 'unnecessary_column' from DataFrame
- Data Transformation - ex; df['log_income'] = df['income'].apply(np.log) computes nat logs of income column
- Documenting Changes - ex; separate doc or file detailing data cleaning steps taken / why

# Preprocessing

1.) pd.read_csv('file')

2.) Clean data (above)

In [None]:
# 3.) Scale data:
scaler = StandardScaler()
data[['feature1', 'feature2']] = scaler.fit_transform(data[['feature1', 'feature2']])

# 4.) Feature encoding
label_encoder = LabelEncoder()
data['category_encoded'] = label_encoder.fit_transform(data['category'])

# 5.) Train-test-split (select features CAREFULLY)
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Machine Learning
Models based on data, to predict similar, unknown data

Supervised = Computer given data to test against
- Classification (ex; diagnosing, image IDing)
- Regression (ex; life expectancy, weather forecasting)

Unsupervised = No test, free
- Dimensionality Reduction (ex; big data visualization)
- Clustering (ex; targeted marketing / customer segmentation)

Reinforcement = Testish, real-time rewards
- ex; that one F1 racing track vid

Cross-Validation = Basically just "test" part, could take many forms
- One split
- "k" number of different splits
- Each datapoint is its own split ?

Feature Selection = Choosing inputs that predict output the best
- Use Confusion/Correlation matrix to see r
- DON'T take features correlated strongly with e/o (confounding)

# Over/underfitting
Overfitting = TOO good for one dataset, bad for others
- Predicts patterns of OG, not indicators
- Too many features, remove them

Underfitting = bad, generally
- Add features
- Train on more data

# Complete Separation:
One variable perfectly predicts the outcome nm what (eg; under this age will always buy toy)
- Indicated by extremes / infinite effect by one var on prediction
- Usually w/ small data / rly obvious connection (car type predicting having car at all)
- Increase sample size / random forests (lotsa random decision trees w diff splits)

# Accuracy, Precision, Recall, ConfusionMatrix
True Positive = Had cancer, diagnosed

True Negative = No cancer, not diagnosed

False Positive = No cancer, diagnosed

False Negative = Had cancer, not diagnosed
- Accuracy = Correct Predictions / All Predictions = (TP + TN) / (All)
- Precision = Correct Yes Predictions / All Yes Predictions = (TP) / (TP + FP)
- Recall = Correct Yes Predictions / All Yes Actual = (TP) / (TP + FN)

ConfusionMatrix:

____Guess No:________Guess Yes:

All No: TN________________FP

All Yes: FN________________TP

# KNN:
- Predicts class based on that of nearest neighbors' (classification) OR avg value (regression)
- Accuracy (correct predictions / all)

# Linear Regression:
- Inputs and outputs, y = mx + b, continuous data
- R^2, how much is y variance explained by x
- Root mean squared error = how far predicted from actual, on avg

# Logistic Regression:
- Inputs and outputs, BINARY outcomes
- Accuracy (bc ^)