# Pandas 
Pandas is a powerful data manipulation library in Python and is widely used in machine learning workflows for data preparation and analysis. Here are some important methods in Pandas that are commonly used in machine learning:

1. **Data Loading:**
   - `pd.read_csv`, `pd.read_excel`, `pd.read_sql`: Load data from various file formats and databases.

2. **Data Exploration:**
   - `df.head()`, `df.tail()`: Display the first or last few rows of the DataFrame.
   - `df.info()`, `df.describe()`: Display information about the DataFrame, including data types and summary statistics.
   - `df.shape`: Get the number of rows and columns in the DataFrame.
   - `df.columns`: Get the column names.

3. **Data Cleaning:**
   - `df.isnull()`, `df.isnull().sum()`: Check for missing values in the DataFrame.
   - `df.dropna()`, `df.fillna()`: Drop or fill missing values.
   - `df.drop_duplicates()`: Remove duplicate rows.
   - `df.replace()`: Replace values in the DataFrame.

4. **Indexing and Selection:**
   - `df.iloc[]`, `df.loc[]`: Select data by index or label.
   - `df['column_name']`, `df[['col1', 'col2']]`: Select columns.
   - `df.query()`: Perform a query on the DataFrame.

5. **Data Transformation:**
   - `df.apply()`, `df.applymap()`: Apply a function to rows or element-wise to the entire DataFrame.
   - `pd.get_dummies()`: One-hot encode categorical variables.
   - `df.astype()`: Convert the data type of a column.
   - `df.join()`, `df.merge()`: Join or merge DataFrames.

6. **Grouping and Aggregation:**
   - `df.groupby()`: Group data based on a column.
   - `grouped_df.aggregate()`, `grouped_df.mean()`, `grouped_df.sum()`: Perform aggregate functions on groups.

7. **Feature Engineering:**
   - Creating new columns based on existing ones.
   - Binning or discretizing continuous variables.
   - Extracting information from text or datetime columns.

8. **Handling Text Data:**
   - `str.contains()`: Check if a string contains a specific substring.
   - `str.extract()`: Extract matched patterns from strings.
   - `str.replace()`: Replace substrings in string columns.

9. **Date and Time Handling:**
   - `pd.to_datetime()`: Convert a column to datetime format.
   - Extracting components of the date (year, month, day).

10. **Data Visualization:**
    - `df.plot()`: Create basic plots directly from the DataFrame.
    - `sns.pairplot()`, `sns.heatmap()`: Visualization using Seaborn or other plotting libraries.


# Pandas

In [19]:
import pandas as pd

### Create Python data and load the data using pandas

In [20]:
data = {"name":["som","sam","sanjay","Ramesh"],
"Mark":[99,76,100,45]}
print(data)

{'name': ['som', 'sam', 'sanjay', 'Ramesh'], 'Mark': [99, 76, 100, 45]}


In [21]:
db_py = pd.DataFrame(data)
print(db_py)

     name  Mark
0     som    99
1     sam    76
2  sanjay   100
3  Ramesh    45


In [22]:
db = pd.read_csv("data/data.csv")
print(db.to_string())

      Loan_ID  Gender Married Dependents     Education Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History Property_Area Loan_Status
0    LP001002    Male      No          0      Graduate            No             5849           0.000000         NaN             360.0             1.0         Urban           Y
1    LP001003    Male     Yes          1      Graduate            No             4583        1508.000000       128.0             360.0             1.0         Rural           N
2    LP001005    Male     Yes          0      Graduate           Yes             3000           0.000000        66.0             360.0             1.0         Urban           Y
3    LP001006    Male     Yes          0  Not Graduate            No             2583        2358.000000       120.0             360.0             1.0         Urban           Y
4    LP001008    Male      No          0      Graduate            No             6000           0.000000       141.

### Find number of Rows and Columns

In [23]:
db.axes

[RangeIndex(start=0, stop=614, step=1),
 Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
        'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
        'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
       dtype='object')]

In [24]:
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [25]:
db.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [26]:
db_loan_id = db['Loan_ID']
print(db_loan_id.head(10))
print(db_loan_id.tail(10))

0    LP001002
1    LP001003
2    LP001005
3    LP001006
4    LP001008
5    LP001011
6    LP001013
7    LP001014
8    LP001018
9    LP001020
Name: Loan_ID, dtype: object
604    LP002959
605    LP002960
606    LP002961
607    LP002964
608    LP002974
609    LP002978
610    LP002979
611    LP002983
612    LP002984
613    LP002990
Name: Loan_ID, dtype: object


### Display all null values

In [27]:
nulldata = db.isnull().any(axis=1)
print(db[nulldata].to_string())

      Loan_ID  Gender Married Dependents     Education Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History Property_Area Loan_Status
0    LP001002    Male      No          0      Graduate            No             5849                0.0         NaN             360.0             1.0         Urban           Y
11   LP001027    Male     Yes          2      Graduate           NaN             2500             1840.0       109.0             360.0             1.0         Urban           Y
16   LP001034    Male      No          1  Not Graduate            No             3596                0.0       100.0             240.0             NaN         Urban           Y
19   LP001041    Male     Yes          0      Graduate           NaN             2600             3500.0       115.0               NaN             1.0         Urban           Y
23   LP001050     NaN     Yes          2  Not Graduate            No             3365             1917.0       112.

## Cleaning Data
- Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.


In [28]:
newdb = db.dropna()
print(newdb.to_string())

      Loan_ID  Gender Married Dependents     Education Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History Property_Area Loan_Status
1    LP001003    Male     Yes          1      Graduate            No             4583        1508.000000       128.0             360.0             1.0         Rural           N
2    LP001005    Male     Yes          0      Graduate           Yes             3000           0.000000        66.0             360.0             1.0         Urban           Y
3    LP001006    Male     Yes          0  Not Graduate            No             2583        2358.000000       120.0             360.0             1.0         Urban           Y
4    LP001008    Male      No          0      Graduate            No             6000           0.000000       141.0             360.0             1.0         Urban           Y
5    LP001011    Male     Yes          2      Graduate           Yes             5417        4196.000000       267.

In [29]:
db.drop_duplicates()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


# SciKit-Learn
Scikit-learn (sklearn) is a comprehensive machine learning library in Python. Here are some important methods and functions in scikit-learn commonly used in machine learning workflows:

1. **Model Selection:**
   - `train_test_split`: Split datasets into training and testing sets.
   - `cross_val_score`: Evaluate a model using cross-validation.
   - `GridSearchCV`: Perform hyperparameter tuning using grid search.

2. **Preprocessing:**
   - `StandardScaler`: Standardize features by removing the mean and scaling to unit variance.
   - `MinMaxScaler`: Scale features to a specified range.
   - `OneHotEncoder`: Encode categorical integer features as one-hot vectors.
   - `LabelEncoder`: Encode target labels with values between 0 and n_classes-1.

3. **Supervised Learning Models:**
   - `LinearRegression`: Linear regression model.
   - `LogisticRegression`: Logistic regression for classification tasks.
   - `DecisionTreeClassifier` and `DecisionTreeRegressor`: Decision tree models.
   - `RandomForestClassifier` and `RandomForestRegressor`: Random Forest models.
   - `SVM`: Support Vector Machines for classification and regression.
   - `KNeighborsClassifier` and `KNeighborsRegressor`: k-Nearest Neighbors models.

4. **Unsupervised Learning Models:**
   - `KMeans`: K-Means clustering algorithm.
   - `PCA`: Principal Component Analysis for dimensionality reduction.
   - `DBSCAN`: Density-Based Spatial Clustering of Applications with Noise.

5. **Metrics and Evaluation:**
   - `accuracy_score`, `precision_score`, `recall_score`, `f1_score`: Evaluation metrics for classification.
   - `mean_squared_error`, `mean_absolute_error`: Evaluation metrics for regression.
   - `confusion_matrix`: Compute confusion matrix for classification tasks.

6. **Ensemble Methods:**
   - `VotingClassifier`: Combine multiple classifiers to improve performance.
   - `BaggingClassifier` and `BaggingRegressor`: Bootstrap Aggregating (Bagging) methods.
   - `AdaBoostClassifier` and `AdaBoostRegressor`: Adaptive Boosting (AdaBoost) methods.

7. **Model Persistence:**
   - `dump` and `load`: Save and load models using joblib.

8. **Pipeline:**
   - `Pipeline`: Construct a pipeline of transformers and an estimator.

9. **Feature Selection:**
   - `SelectKBest`, `SelectPercentile`: Select top k or top percentile features based on statistical tests.

10. **Text and Image Processing:**
    - `CountVectorizer` and `TfidfVectorizer`: Convert text data to a bag-of-words representation.
    - `HashingVectorizer`: Convert text data to a fixed-size hash.
    - `ImageFeatureExtractor`: Extract features from images.

These are just a few examples of the many functionalities provided by scikit-learn. The library is well-documented, and you can find detailed information and examples in the official documentation: [Scikit-learn Documentation](https://scikit-learn.org/stable/documentation.html).

Normalization and standardization are two common preprocessing techniques used in machine learning to scale and transform feature values. These techniques are applied to ensure that input features have similar scales, which can be important for certain machine learning algorithms.

1. **Normalization:**
   - **Method:** Scaling the values of a feature to a specific range, often [0, 1].
   - **Formula:** \(X_{\text{normalized}} = \frac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}\)
   - **Scikit-learn Example:**
`

2. **Standardization:**
   - **Method:** Transforming the values of a feature to have a mean of 0 and a standard deviation of 1.
   - **Formula:** \(X_{\text{standardized}} = \frac{X - \text{mean}(X)}{\text{std}(X)}\)
   - **Scikit-learn Example:**


# Scipy
Scipy is a scientific computing library in Python that builds on NumPy and provides additional functionality for scientific and technical computing. While scikit-learn is often the primary library for machine learning tasks, Scipy can complement it with various scientific and statistical functions. Here are some important methods in Scipy that are relevant to machine learning:

1. **Statistical Functions:**
   - `scipy.stats`: Provides a wide range of statistical functions, probability distributions, and statistical tests. Functions like `ttest_ind`, `pearsonr`, and `chi2_contingency` are commonly used in hypothesis testing and statistical analysis.

2. **Optimization:**
   - `scipy.optimize`: Contains functions for optimization problems. `minimize` is a versatile optimization routine that can be used for parameter tuning in machine learning models.

3. **Sparse Matrix Operations:**
   - `scipy.sparse`: Provides sparse matrix functionality, which is useful for handling large datasets with many zero values efficiently. Sparse matrices are commonly used in certain machine learning algorithms.

4. **Linear Algebra:**
   - `scipy.linalg`: Extends NumPy's linear algebra capabilities, providing additional functionality such as solving linear systems, eigenvalue problems, and singular value decomposition.

5. **Signal Processing:**
   - `scipy.signal`: Offers functions for signal processing, including filtering, spectral analysis, and waveform generation. Useful for preprocessing tasks in signal-related machine learning problems.

6. **Image Processing:**
   - `scipy.ndimage`: Contains functions for image processing, including image filtering, morphology, and measurements. Useful for preprocessing in computer vision and image-based machine learning tasks.

7. **Sparse Eigenvalue Problems:**
   - `scipy.sparse.linalg`: Provides functions for solving sparse eigenvalue problems, which can be relevant in certain machine learning applications.

8. **Integration:**
   - `scipy.integrate`: Offers functions for numerical integration and solving ordinary differential equations. Useful in certain types of simulations and dynamic systems modeling.

9. **Interpolation:**
   - `scipy.interpolate`: Provides functions for interpolating data, which can be useful for smoothing or resampling datasets in machine learning.

10. **Distance Metrics:**
    - `scipy.spatial.distance`: Includes functions for calculating distances between points, useful for clustering and nearest neighbors algorithms.

While scikit-learn is more focused on machine learning algorithms and model evaluation, Scipy provides a broader set of tools for scientific computing, making it a valuable companion for various tasks related to machine learning.

# NumPy
NumPy is a fundamental library for numerical computing in Python and is extensively used in machine learning for numerical operations and array manipulations. Here are some important methods and functions in NumPy that are commonly used in machine learning:

1. **Array Creation:**
   - `numpy.array()`: Create an array from a list or tuple.
   - `numpy.zeros()`, `numpy.ones()`: Create arrays filled with zeros or ones.
   - `numpy.arange()`, `numpy.linspace()`: Generate arrays with evenly spaced values.

2. **Array Operations:**
   - Arithmetic operations (`+, -, *, /, **`): Perform element-wise arithmetic on arrays.
   - `numpy.dot()`: Perform matrix multiplication.
   - `numpy.transpose()`, `array.T`: Transpose an array.

3. **Array Indexing and Slicing:**
   - Indexing (`array[0]`): Access elements at a specific index.
   - Slicing (`array[start:stop:step]`): Extract a portion of the array.
   - Fancy indexing: Use arrays as indices.

4. **Array Shape Manipulation:**
   - `numpy.reshape()`: Reshape an array.
   - `numpy.flatten()`, `numpy.ravel()`: Flatten arrays.
   - `numpy.concatenate()`, `numpy.vstack()`, `numpy.hstack()`: Combine arrays.

5. **Mathematical Functions:**
   - `numpy.sum()`, `numpy.mean()`, `numpy.median()`: Calculate sum, mean, and median.
   - `numpy.std()`, `numpy.var()`: Calculate standard deviation and variance.
   - `numpy.min()`, `numpy.max()`: Find minimum and maximum values.
   - `numpy.abs()`: Compute the absolute values.

6. **Linear Algebra:**
   - `numpy.linalg.inv()`: Compute the inverse of a matrix.
   - `numpy.linalg.det()`: Compute the determinant of a matrix.
   - `numpy.linalg.eig()`: Compute the eigenvalues and eigenvectors of a square matrix.

7. **Random Number Generation:**
   - `numpy.random.rand()`: Generate random samples from a uniform distribution.
   - `numpy.random.randn()`: Generate random samples from a standard normal distribution.
   - `numpy.random.randint()`: Generate random integers.
   - `numpy.random.shuffle()`: Shuffle the elements of an array.

8. **Array Comparison and Boolean Indexing:**
   - Comparison operators (`<, >, <=, >=, ==, !=`): Perform element-wise comparisons.
   - Boolean indexing: Use boolean arrays to index arrays selectively.

9. **Broadcasting:**
   - Implicit element-wise operations between arrays of different shapes and sizes.

10. **File I/O:**
    - `numpy.save()`, `numpy.load()`: Save and load arrays to/from disk.

These are just a few examples of the many functionalities provided by NumPy. NumPy is a crucial library for numerical computations and forms the backbone of many machine learning libraries, including scikit-learn. Familiarity with NumPy is essential for working efficiently with numerical data in the context of machine learning.