<a href="https://colab.research.google.com/github/ManoharKonala/BootCamp_2K24/blob/main/Pandas_1_Introduction_to_Pandas%2CInstalling_Pandas%2CPandas_Series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
!git clone https://github.com/ManoharKonala/BootCamp_2K24.git

Cloning into 'BootCamp_2K24'...
remote: Enumerating objects: 255, done.[K
remote: Counting objects: 100% (146/146), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 255 (delta 87), reused 56 (delta 25), pack-reused 109[K
Receiving objects: 100% (255/255), 331.77 KiB | 7.06 MiB/s, done.
Resolving deltas: 100% (123/123), done.


#**What is Data Manipulation?**
Data manipulation involves transforming, cleaning, and restructuring data to make it usable for analysis and modeling. This process includes a variety of tasks such as:

* **Cleaning:** Removing or correcting errors, inconsistencies, and missing values in the data.
* **Transforming:** Changing the format or structure of the data to make it more suitable for analysis.
* **Reshaping:** Adjusting the data layout, for example, pivoting or unpivoting data tables.
* **Aggregating:** Summarizing data by grouping and applying aggregate functions like sum, mean, or count.
* **Merging:** Combining multiple datasets into a single dataset.


#**Importance of Data Manipulation in Data Science and Analysis**
 **1. Data Quality**
* Raw Data Issues: Raw data often contains errors, inconsistencies, and missing values. These issues can arise from data entry mistakes, sensor errors, or incomplete data collection.
* Cleaning and Validation: Data manipulation allows you to identify and correct these issues, ensuring the data is accurate and reliable for analysis.

**2. Data Reshaping**
* Inappropriate Formats: Data might not always be in the right format for analysis or modeling. For instance, you might have data spread across multiple tables or in a format that is not suitable for the analysis you want to perform.
* Reshaping for Analysis: Data manipulation helps you transform and restructure the data into a format that is more suitable for specific analytical tasks, such as merging datasets, creating pivot tables, or reshaping data from wide to long format.

**3. Feature Engineering**
* Creating New Features: Feature engineering involves creating new features or variables from existing data to improve model performance in machine learning.
* Enhancing Model Accuracy: Data manipulation is essential for generating these new features, which can capture additional patterns and relationships in the data that the original features might miss.

**4. Exploration and Visualization**
* Preparation for EDA: Exploratory Data Analysis (EDA) involves visualizing and summarizing the main characteristics of a dataset.
* Visualizing Patterns: By manipulating the data, you can create visualizations and summary statistics that help you understand patterns, trends, and relationships in the data, making it easier to generate insights and hypotheses.

# **Pandas: A Library for Handling Structured Data in Python**

- **Powerful and Versatile:** Specifically designed for data manipulation and analysis.
- **Efficient Data Structures:** Handles tabular data (like Excel or SQL tables) and time series data.
- **Core Components:**
  - **DataFrame:** 2D labeled data structure, similar to a table.
  - **Series:** 1D labeled array, capable of holding any data type.
- **Wide Range of Functionalities:**
  - Data cleaning
  - Data transformation
  - Merging and joining datasets
  - Handling missing data
  - Performing group operations
- **Seamless Integration:** Works well with other popular Python libraries like NumPy, SciPy, and matplotlib.
- **Essential Tool:** Indispensable for data scientists and analysts for processing, analyzing, and visualizing structured data.



# Benefits of Using Pandas for Data Cleaning and Analysis

- **Ease of Use:**
  - Simple and intuitive API.
  - Easy to load, manipulate, and analyze data with DataFrames and Series.

- **Flexibility:**
  - Wide range of functionalities for various data manipulation needs.
  - Suitable for data cleaning, EDA, and building machine learning models.

- **Handling Large Datasets:**
  - Efficiently handles large datasets.
  - Optimizes memory usage and improves processing speed.

- **Comprehensive Data Cleaning:**
  - Handles missing values, removes duplicates, and standardizes data formats.
  - Ensures data quality and prepares raw data for analysis.

- **Advanced Data Analysis:**
  - Performs complex operations like group-by, rolling statistics, and multi-level indexing.
  - Enables in-depth data analysis and extraction of valuable insights.

- **Integration with Other Libraries:**
  - Works well with NumPy, SciPy, and matplotlib.
  - Enhances capabilities and allows for seamless workflows in data science projects.

- **Rich Functionality for Time Series Analysis:**
  - Tools for date range generation, frequency conversion, and time series-specific operations.
  - Ideal for financial data analysis and other time-stamped data applications.

- **Built-in Data Visualization:**
  - Quick creation of basic visualizations directly from DataFrames and Series.
  - Complements specialized visualization libraries like matplotlib and seaborn.

# Installing Pandas Locally
* You can install Pandas using pip from the command line.

# Command to install Pandas
pip install pandas

# Command to install Pandas with Python 3
* pip3 install pandas

# Installing Pandas in a Virtual Environment
* Create a new virtual environment (e.g., named 'myenv')

* python -m venv myenv

 Activate the virtual environment

**Windows**
 * myenv\Scripts\activate

**macOS/Linux**
 * source myenv/bin/activate


# Install Pandas within a Jupyter notebook
!pip install pandas

# PANDAS SERIES
* A Pandas Series is a one-dimensional labeled array capable of holding any data type. It is similar to a column in a DataFrame or an array in NumPy but comes with additional features like labeling and built-in methods for data manipulation.

**From List**

In [4]:
import pandas as pd
my_list = [120, 240, 340, 460, 560]
my_series = pd.Series(my_list)
print(my_series,"\n")

# with a index
My_series = pd.Series(my_list,index=("a","b","c","d","e"))
print(My_series)


0    10
1    20
2    30
3    40
4    50
dtype: int64 

a    10
b    20
c    30
d    40
e    50
dtype: int64


**From Arrays**

In [8]:
import numpy as np
arr = np.array([120, 240, 340, 460, 560])
my_series = pd.Series(arr)
print(my_series,"\n")

# with a index
My_series = pd.Series(arr,index=("a","b","c","d","e"))
print(My_series)

0    120
1    240
2    340
3    460
4    560
dtype: int64 

a    120
b    240
c    340
d    460
e    560
dtype: int64


**From Dictionary**

In [13]:
dict = {"a":120,"b":240,"c":340,"d":460,"e":560}
my_series = pd.Series(dict)
print(my_series,"\n")

# With index
My_series = pd.Series(dict,index=("a","b","c","d","e"))
print(My_series)

a    120
b    240
c    340
d    460
e    560
dtype: int64 

a    120
b    240
c    340
d    460
e    560
dtype: int64


# Exploring Series Attributes
**Common Attributes**
* s.index: Returns the index labels of the Series.
* s.values: Returns the underlying values in the Series as a NumPy array.
* s.dtype: Provides the data type of the Series elements.
* s.size: Returns the number of elements in the Series.

In [15]:
my_list = [120, 240, 340, 460, 560]
my_series = pd.Series(my_list)
print(my_series.index)
print(my_series.values)
print(my_series.dtype)
print(my_series.size)


RangeIndex(start=0, stop=5, step=1)
[120 240 340 460 560]
int64
5


**Common Methods**
Pandas Series methods allow you to perform various operations, including arithmetic, data manipulation, and indexing.

* s.head(n): Returns the first n elements of the Series.
* s.tail(n): Returns the last n elements.
* s.sort_values(): Sorts the Series by its values.
* s.mean(), s.median(), s.std(): Compute common statistics.
* s.str: Provides string manipulation methods if the Series contains string data.
* s.apply(func): Applies a function to each element in the Series.

In [23]:
my_list = [120, 2400, 340, 40,0]
my_series = pd.Series(my_list)

Another_list = ["raju","ante","rebel","aa","ra"]
My_series = pd.Series(Another_list)

print("First two elements:\n",my_series.head(2),"\n")
print("Last two elements:\n",my_series.tail(2),"\n")
print("Sorted Series:\n",my_series.sort_values(),"\n")
print("Mean:\n",my_series.mean(),"\n")
print("Median:\n",my_series.median())
print("Standard Deviation:\n",my_series.std(),"\n")
print("String Methods:\n", My_series.str.upper(),"\n")
print("Apply Method:\n",my_series.apply(lambda x: x**2))

First two elements:
 0     120
1    2400
dtype: int64 

Last two elements:
 3    40
4     0
dtype: int64 

Sorted Series:
 4       0
3      40
0     120
2     340
1    2400
dtype: int64 

Mean:
 580.0 

Median:
 120.0
Standard Deviation:
 1025.865488258573 

String Methods:
 0     RAJU
1     ANTE
2    REBEL
3       AA
4       RA
dtype: object 

Apply Method:
 0      14400
1    5760000
2     115600
3       1600
4          0
dtype: int64
