In [None]:
"""
🔹 What is machine lerning:

Machine Learning is a subset of Artificial Intelligence (AI) that 
enables systems to learn from data, identify patterns, and make decisions with minimal human intervention.

🔹 How does it work (with diagram):

     +-------------+       +-------------+       +----------------+       +---------------+
     | Input Data  |  -->  | ML Algorithm|  -->  | Trained Model  |  -->  | Predictions   |
     +-------------+       +-------------+       +----------------+       +---------------+
           |                      ↑                     ↓                        |
           |                      +---- Training ----->+                        |
           |                            using labeled/unlabeled data            |
           ↓                                                                   ↓
     Feature Engineering                                                Model Evaluation

Input Data:
Historical data collected from real-world events (sales, images, sensor readings, etc.).

Feature Engineering:
Extracting relevant features (attributes) from raw data.

ML Algorithm:
Algorithm like Linear Regression, Decision Tree, or Neural Network learns a mapping from input features to output labels.

Trained Model:
After learning from training data, it becomes a model that can predict outcomes for unseen data.

Prediction:
New data is input into the model, and it generates predictions (e.g., spam or not spam, house price, etc.).

Model Evaluation:
Accuracy, precision, recall, etc., are used to assess model performance.


🔹why it is used:

| Use Case                     | Description                                                                   |
| ---------------------------- | ----------------------------------------------------------------------------- |
| **Automation**               | Automates repetitive tasks without being hardcoded.                           |
| **Prediction**               | Forecasts outcomes based on historical patterns (e.g., stock prices, demand). |
| **Pattern Recognition**      | Identifies complex patterns (e.g., fraud detection, facial recognition).      |
| **Personalization**          | Recommends products, movies, music, etc., tailored to user preferences.       |
| **Improved Decision Making** | Supports data-driven decisions in healthcare, finance, logistics.             |


🔹advantages:

| Advantage                    | Description                                                                           |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| **Automation of tasks**      | ML can automate and improve processes.                                                |
| **Improves over time**       | More data = better performance.                                                       |
| **Handles complex problems** | Can solve problems where traditional rules don't work well (e.g., image recognition). |
| **Personalization**          | Customized recommendations and user experiences.                                      |
| **Scalability**              | Can handle huge volumes of data better than manual systems.                           |

🔹disadvantages:

| Disadvantage                  | Description                                                   |
| ----------------------------- | ------------------------------------------------------------- |
| **Requires large data**       | Poor performance with limited or poor-quality data.           |
| **Computationally intensive** | Needs high processing power for training models.              |
| **Lack of explainability**    | Many models (like neural networks) are black boxes.           |
| **Overfitting**               | Learns noise instead of pattern; performs poorly on new data. |
| **Bias in data**              | If training data is biased, the model will be biased too.     |


🔹process of developing ml application:

| Step                          | Description                                                             |
| ----------------------------- | ----------------------------------------------------------------------- |
| **1. Problem Definition**     | What do you want the model to predict or classify?                      |
| **2. Data Collection**        | Gather datasets from sensors, logs, databases, APIs, etc.               |
| **3. Data Preprocessing**     | Clean, normalize, handle missing data, and convert to numeric format.   |
| **4. Feature Engineering**    | Select and transform input variables (features) to be used in training. |
| **5. Model Selection**        | Choose an algorithm: Linear Regression, Decision Tree, SVM, etc.        |
| **6. Model Training**         | Fit the model to the data (learn from it).                              |
| **7. Model Evaluation**       | Use test data and metrics like accuracy, precision, recall, F1-score.   |
| **8. Hyperparameter Tuning**  | Adjust model settings to optimize performance.                          |
| **9. Deployment**             | Integrate the model into a live application (web app, mobile app).      |
| **10. Monitoring & Updating** | Keep checking performance and retrain if needed.                        |


🔹Tabular difference between machine learning and traditional programming:

| Feature                 | Traditional Programming                  | Machine Learning                          |
| ----------------------- | ---------------------------------------- | ----------------------------------------- |
| **Input**               | Data + Rules (manually written)          | Data + Expected Outputs                   |
| **Output**              | Output (result of the logic)             | Model (rules learned automatically)       |
| **Logic Creation**      | Manually coded by humans                 | Learned by algorithms from data           |
| **Adaptability**        | Hard to adapt to new data                | Learns and adapts from new data           |
| **Complexity Handling** | Not suitable for highly complex patterns | Handles complex, non-linear relationships |
| **Error Handling**      | Explicitly defined                       | Inferred from probability and statistics  |
| **Example**             | Calculator (manual rules)                | Spam Filter (learns what spam is)         |
| **Dependency**          | Heavily depends on programmer logic      | Heavily depends on quality of data        |
"""

In [None]:
"""
🔹Variables in machine learning

Variables in ML are attributes, features, or columns of data that influence 
the model’s learning. These are categorized into input (independent) and 
output (dependent) variables.

🔹Types of varibales:

1. Numerical variables :Variables with numeric values. E.g.Salary, Age, Number of purchases
            -Discrete numerical: These variables can only take specific, 
                                 distinct, and separate values (usually integers). 
                                 They are countable.
                                 E.g: No. of students in class, No. of transactions etc

            -Continuous numerical:These variables can take any real value within a given range, 
                                  including decimals. They are measurable, not countable.
                                E.g: Heights of 1000 people (4 feet to 7 feet : any value between can be taken)

2. Categorical variables :Variables that represent categories. E.g Gender(Male or Female),Rating(Low,medium,High)
            -Ordinal categorical: Categorical variables with a meaningful order. 
                                  E.g: Rating(High medium low)	
            -Nominal categorical: Categorical variables without any order.	
                                  E.g. Country,region etc
            -Binary categorical:  Categorical variables with only two categories E.g. Gender(Male and Female)

3. Temporal Variables :Time variables represent points in time, time durations, 
                       or ordered timestamps. They are used to track when an event 
                       happened or to model temporal dependencies in data.
                       E.g. 2024-06-21 14:30:00
                       
4. Textual variables: Text variables contain free-form unstructured textual data. 
                      These are common in natural language processing (NLP) and must
                      be transformed into numerical form (vectorized) before being used in ML models.
                      E.g. "This product is great!"

🔹Variable type conversion

Variable type conversion in ML refers to the process of converting a variable (column/feature) 
one data type to another, such as:

Categorical → Numerical
Text → Numerical
Integer → Float
String Date → Timestamp etc.

This step is critical because most ML algorithms can only process numerical data
(especially for tabular models like decision trees, logistic regression, SVMs, etc.).

"""

# **Data Cleaning**

Data Cleaning (also known as data cleansing or data scrubbing) is the process of detecting, 
correcting, or removing errors and inconsistencies in data to improve its quality and reliability. 
This is a critical step in data preprocessing because raw data is almost always incomplete, 
inconsistent, inaccurate, or duplicated, which can mislead the machine learning model or any analysis.


🔍 Detailed Steps in Data Cleaning
| Step                            | Description                                       | Methods                           |
| ------------------------------- | ------------------------------------------------- | --------------------------------- |
| 1. **Remove duplicates**        | Eliminate repeated rows                           | `df.drop_duplicates()`            |
| 2. **Handle missing values**    | Fill or drop nulls                                | `df.fillna()`, `df.dropna()`      |
| 3. **Correct data types**       | Convert strings to int/float/date etc.            | `df.astype()`                     |
| 4. **Fix inconsistencies**      | Normalize text values                             | Convert "USA", "usa" to uppercase |
| 5. **Remove outliers**          | Drop or replace extreme values                    | IQR method, Z-score method        |
| 6. **Standardize formats**      | Consistent date/currency formats                  | Use `datetime`, regex             |
| 7. **Spell check / clean text** | Remove typos, unwanted characters                 | NLP techniques                    |
| 8. **Validate ranges**          | Ensure numeric values fall within expected ranges | Logical filters                   |



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Reading data from CSV file

df=pd.read_csv(r'C:\Users\SHREE\Desktop\Data-Science\Data-science\Data.csv')#RAW STRING (to convert str into path so it manages slashes)
df.sample(10)

Unnamed: 0.1,Unnamed: 0,brand_name,model_name,os,popularity,best_price,lowest_price,highest_price,sellers_amount,screen_size,memory_size,battery_size,release_date
848,848,OPPO,Reno 4 Lite 8/128GB Magic Blue,Android,991,10221.0,8668.0,12760.0,57,6.43,128.0,4025.0,9-2020
1173,1173,Apple,iPhone 6s 128GB Space Gray (MKQT2),iOS,412,6876.0,6319.0,7990.0,3,4.7,128.0,1715.0,9-2015
334,334,ERGO,F188 Play DS Black,,333,272.0,227.0,289.0,10,1.77,,1000.0,2-2020
536,536,Samsung,Galaxy S9+ SM-G965 DS 128GB Blue,Android,230,13436.0,12592.0,14703.0,9,6.2,128.0,3500.0,2-2018
97,97,OnePlus,6T 8/256GB Midnight Black,Android,661,15625.0,15080.0,16213.0,8,6.41,256.0,3700.0,10-2018
887,887,DOOGEE,S90C 4/128GB Black,Android,476,5612.0,4899.0,6128.0,6,6.18,128.0,5050.0,8-2020
691,691,HUAWEI,P Smart Pro 6/128GB Midnight Black (51094UVB),Android,994,7199.0,6999.0,7399.0,4,6.59,128.0,4000.0,11-2019
1167,1167,Apple,iPhone 6 32GB Space Grey (MQ3D2),iOS,644,6988.0,4550.0,8137.0,5,4.7,32.0,1810.0,3-2017
486,486,Samsung,Galaxy S10+ SM-G975 SS 512GB White,Android,413,18739.0,,,1,6.4,512.0,4000.0,2-2019
915,915,Microsoft,Surface Duo 6GB/256GB (TGM-00001),Android,501,44232.0,42999.0,46879.0,5,8.1,256.0,3577.0,1-2021


In [10]:
df.shape #return a tuple (no. of rows , no. of columns)

(1224, 13)

In [None]:
#Null values in dataset are represented as NaN (Not a Number)

#df.isnull() #return a matrix of True and False idicating if the alue if null or not
df.isnull().sum() #summerizes null values per column #use this
#df.isnull().sum().sum()# summerizes null values for all dataset

Unnamed: 0          0
brand_name          0
model_name          0
os                197
popularity          0
best_price          0
lowest_price      260
highest_price     260
sellers_amount      0
screen_size         2
memory_size       112
battery_size       10
release_date        0
dtype: int64