<a href="https://colab.research.google.com/github/MonkeyWrenchGang/PythonBootcamp/blob/main/day_3/3_5_Pandas_Recap_and_Query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Power of the query() Function

## Introduction:

In the realm of data analysis and manipulation, `pandas` and it's functions are the most important skills to master.


In this notebook, we will recap our basic pandas functions and dive into one of its powerful features: the `query()` function. The `query()` function allows us to perform **advanced filtering**  operations on DataFrame objects, enabling us to extract specific subsets of data based on variety of conditions.

## Recap of Pandas:

At its core, pandas revolves around two primary data structures: Series and DataFrame.

- **Series**: A Series is a one-dimensional named array that can hold any data type. It is similar to a column in a spreadsheet or a single variable in a statistical analysis.

- **DataFrame**: A DataFrame is a two-dimensional named data structure with columns of potentially different data types. It is similar to a spreadsheet or a SQL table, where each **column represents a different variable**, and each row represents an individual observation or record.

Pandas provides an extensive range of functions and methods to manipulate, transform, and analyze data within Series and DataFrames.

## Introduction to the query() Function:

The `query()` function provides an expressive and concise way to filter DataFrames based on logical conditions, using a syntax similar to SQL queries. With `query()`, we can retrieve specific subsets of data that meet our specified criteria, eliminating the need for lengthy and complex boolean indexing expressions.

The `query()` function accepts a **string expression as its argument**, which represents the logical condition to be evaluated. It allows us to **reference column names directly within the query expression**, simplifying the filtering process significantly.




# 0. Import Libraries


---



In [1]:
import warnings
import pandas as pd

# Suppress the FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

# 1. Import Data


---

```python
churn = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/PythonBootcamp/main/day_3/data/churn.csv")
churn.head()
```

# Head()


---


The .head() method in pandas allows you to peek at the top rows of a DataFrame or a Series. It provides a quick preview of the data, displaying the first 5 rows by default. This helps you get a glimpse of the structure, contents, and column headers of the dataset.

# Tail()

# Shape


---

The .shape attribute in pandas provides the number of rows and columns in a DataFrame or a NumPy array, allowing you to quickly understand the size and structure of your data.

```
churn.shape
```


(3333, 21)

# Clean up Columns


---

## Clean up Column Names

It's just not fun dealing with ill-formed columns

- remove leading and trailing characters
- replace spaces with underscores (`_`)
- change case to lower case
- remove various special characters (`?`)


**EARMARK This CODE!!!**
```python
df.columns = ( df.columns
    .str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace('-', '_')
    .str.replace('(', '')
    .str.replace(')', '')
    .str.replace('?', '')
    .str.replace('\'', '') # notice the backslash \ this is an escape character
)
print(df.columns)

```



# Dtypes


---


The .dtypes attribute in pandas provides the data types of each column in a DataFrame, allowing you to quickly understand the data structure and make informed decisions based on the specific data types present.

# Info


---


The .info() method in pandas provides a concise summary of the DataFrame, including information about the data types, non-null values, and memory usage. It is a useful tool for quickly assessing the overall structure and properties of the DataFrame.

# Describe

The `.describe()` method in pandas produces the following summary statistics:

- Count: Number of non-null values in each numerical column.
- Mean: Average value of each numerical column.
- Standard Deviation: Measure of the spread or dispersion of values in each numerical column.
- Minimum: Smallest value in each numerical column.
- 25th Percentile (1st Quartile): Value below which 25% of the data falls in each numerical column.
- Median (2nd Quartile): Middle value in each numerical column.
- 75th Percentile (3rd Quartile): Value below which 75% of the data falls in each numerical column.
- Maximum: Largest value in each numerical column.




# Value Counts


---


The `.value_counts()` method in pandas counts the occurrences of unique values in a Series or a single column of a DataFrame. Known as FREQUENCY DISTRIBUTION


For example, if you have a Series named `s`, calling `s.value_counts()` will return the count of each unique value in the Series.

Here's an example:

```python
import pandas as pd

# Creating a Series
s = pd.Series(['A', 'B', 'A', 'C', 'A', 'B', 'B', 'C', 'C', 'C'])

# Counting the occurrences of unique values
value_counts = s.value_counts()

print(value_counts)
```

use value_counts() to count the frequency of

- state
- churn
- area_code

repeat with  value_counts(normalize=True)

# Query


---

Answer the Following Questions

- Exercise 1: Create a new data frame of churned customers res1, i.e. query the churn dataset and filter for rows where "churn" is equal to True. how many rows use print and shape.
- Exercise 2: Among the churned customers, who made more than 3 customer service calls? make a dataset "res2" how many rows. how many rows use print and shape.
- Exercise 3: Query for customers that have intl_plan = yes and eve_charge < 100, calculate the % of churn not churn. (use value_counts and normalize = True)
- Exercise 4: Which customers have more than 200 minutes of daytime usage and less than 150 minutes of evening usage? calculate the % of churn not churn. (use value_counts and normalize = True)
- Exercise 5: Which customers are located in either New York (NY) or California (CA) and have churned?


In [8]:
# Ex 1

In [None]:
# Ex 2

In [None]:
# Ex 3

In [None]:
# Ex 4

In [None]:
# Ex 5