# Pandas and Data Cleaning Basics

## 1. What does Pandas stand for?
**Pandas** stands for **"Python Data Analysis Library"**.  
The name is derived from the term **"Panel Data"**, an econometrics term for multidimensional structured datasets.

---

## 2. What are the 2 collections used in Pandas?
- `Series`: A one-dimensional labeled array.
- `DataFrame`: A two-dimensional labeled table (rows and columns).

---

## 3. Name 4 things Pandas can do for us.
1. Load and read data from CSV, Excel, JSON, SQL, etc.
2. Clean and preprocess data (e.g., handle missing values, rename columns).
3. Perform data analysis with grouping, aggregation, and statistics.
4. Sort and manipulate data (merge, join, reshape, filter, etc.).

---

## 4. To permanently sort a DataFrame, which keyword should one use with the `df.sort()` method?
  df.sort_values(by='coloumnname',inplace=True)



## 5. What is a CSV?

CSV stands for Comma-Separated Values.


It's a plain text file format where each line is a data record, and columns are separated by commas. It's commonly used for storing tabular data.

---

## 6.  When cleaning data what values do we not like in our data?

We typically want to remove or handle the following:

1. Missing values (NaN, None)
2. Duplicates
3. Inconsistent formatting (e.g., lowercase vs uppercase, extra spaces)
4. Outliers
5. Invalid or corrupted data (e.g., strings in a numeric column)

---

In [None]:


#7. Import NumPy, use one of the NumPy methods 
# and create an array with a 
# shape of (2, 3, 2). 
# You can use the reshape method -- `.reshape()`

import numpy as np
array1=np.arange(12)
print(array1)
reshapedarray=array1.reshape(2,3,2)
print(reshapedarray)

[ 0  1  2  3  4  5  6  7  8  9 10 11]
[[[ 0  1]
  [ 2  3]
  [ 4  5]]

 [[ 6  7]
  [ 8  9]
  [10 11]]]


In [None]:
#8. Use NumPy `.linspace()` to create an array with 
# 6 linearly spaced values between 0 and 20


arr = np.linspace(0, 20, 6)
print(arr)

[ 0.  4.  8. 12. 16. 20.]


In [None]:
#9. Make a Deep Copy of the above array
import copy
copyabove_arr=copy.deepcopy(arr)
print(copyabove_arr)


[ 0.  4.  8. 12. 16. 20.]


In [10]:
#10. Concatenate these 3 arrays into a new array named 'newArray'...
#   ```python
#           ([[25, 16]])
#           ([[11, 2], [13, 4]])
#          ([[7, 81], [5, 6], [11, 12]])
#    ```

arr1=([[25, 16]])
arr2=([[11, 2], [13, 4]])
arr3=([[7, 81], [5, 6], [11, 12]])

newArray=np.concatenate((arr1,arr2,arr3),axis=0)
print(newArray)

[[25 16]
 [11  2]
 [13  4]
 [ 7 81]
 [ 5  6]
 [11 12]]


In [11]:
#11. Sort 'newArray' in order into 'sortedArray'
sortedArray=np.sort(newArray)
print(sortedArray)

[[16 25]
 [ 2 11]
 [ 4 13]
 [ 7 81]
 [ 5  6]
 [11 12]]


In [17]:
#12. Unpack the array tuples from the above 'reshapedArray'  
# into 4 well named variables. Print the 4 variables.
print(reshapedarray)
a,b,c=reshapedarray[0]
d,e,f=reshapedarray[1]
#print the 4 variables
print(f"4 variables {a},{b},{c},{d}")


[[[ 0  1]
  [ 2  3]
  [ 4  5]]

 [[ 6  7]
  [ 8  9]
  [10 11]]]
4 variables [0 1],[2 3],[4 5],[6 7]


In [8]:
#13. Combined and sort the following arrays into one called 'comboArray' ...

    
one = ([10, 11, 12, 13, 14, 15, 16, 17])
two = ([20, 21, 22, 23, 24, 25, 26, 27])
three = ([ 0, 1, 2, 3, 4, 5, 6, 7])
combinedArray=np.concatenate((one,two,three),axis=0)
comboArray=np.sort(combinedArray)
print(comboArray)

[ 0  1  2  3  4  5  6  7 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27]


In [None]:
#14 Take 'comboArray' and perform the following slicing activities:
 #   - print sec1 - the 2nd element
 #    - print sec2 - all elements from the 3rd element to the last
 #    - print sec3 - all elements from the 4th to the 14th elements
 #    - print sec4 - the last 6 elements
 #    - print sec5 - all element from #0 up to and including #15, using the negative number method, i.e. taking a section from the end.
 #    - print sec6 - from #20 every even element to the end
 #    - print sec7 - from the last element moving forward, every 5th element.

sec1 =comboArray[1]
sec2=comboArray[2:]
sec3=comboArray[3:15]
sec4=comboArray[-6::]
sec5=comboArray[:16]
sec6=comboArray[20::2]
sec7=comboArray[-1::5]
print(sec1)
print(sec2)
print(sec3)
print(sec4)
print(sec5)
print(sec6)
print(sec7)

1
[ 2  3  4  5  6  7 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27]
[ 3  4  5  6  7 10 11 12 13 14 15 16]
[22 23 24 25 26 27]
[ 0  1  2  3  4  5  6  7 10 11]
[24 26]
[27]


In [10]:
#15. Using `Series`, create a `DataFrame` that looks like this:

   # | Ingredients | Quantity | Unit |
    # |----|----|----|
    # | Flour | 4 | cups |
    # | Milk | 1 | cup |
    # | Eggs | 2 | large |
    # | Spam | 1 | can |

    # Name: Dinner, dtype: object
import pandas as pd
Ingredients=pd.Series(["Flour","Milk","Eggs","Spam"])
Quantity=pd.Series([4,1,2,1])
Unit=pd.Series(["cup","cup","large","can"])

df=pd.DataFrame({
    "Ingredients":Ingredients,
    "Quantity":Quantity,
    "Unit":Unit
})
print(df)

  Ingredients  Quantity   Unit
0       Flour         4    cup
1        Milk         1    cup
2        Eggs         2  large
3        Spam         1    can


In [11]:
# 16. Take this data and create a DataFrame named studentData
#     ```Python
#         {'Name': ['Jai', 'janusha', 'Gaurav', 'Anuj'],
#             'Height': [5.1, 6.2, 5.1, 5.2],
#             'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
#             'address': ['Delhi', 'Doha', 'Chennai', 'Dakhar'],
#             'Age': [21, 23, 24, 21],
#             'Pets': ['Dog', 'Bunny', 'Chinchilla', 'Parrot'],
#             'sport': ['Darts', 'Basketball', 'PaddleBoarding', 'Cricket']
#         }
#     ```

data = {
    'Name': ['Jai', 'janusha', 'Gaurav', 'Anuj'],
    'Height': [5.1, 6.2, 5.1, 5.2],
    'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
    'address': ['Delhi', 'Doha', 'Chennai', 'Dakhar'],
    'Age': [21, 23, 24, 21],
    'Pets': ['Dog', 'Bunny', 'Chinchilla', 'Parrot'],
    'sport': ['Darts', 'Basketball', 'PaddleBoarding', 'Cricket']
}

studentData = pd.DataFrame(data)
print(studentData)

      Name  Height Qualification  address  Age        Pets           sport
0      Jai     5.1           Msc    Delhi   21         Dog           Darts
1  janusha     6.2            MA     Doha   23       Bunny      Basketball
2   Gaurav     5.1           Msc  Chennai   24  Chinchilla  PaddleBoarding
3     Anuj     5.2           Msc   Dakhar   21      Parrot         Cricket


In [None]:

# 17. Add a new column to the DataFrame with the following deserts:
#         ["ice cream", "Cashew Fudge", "waffels", "Carrot Halwa"]
studentData["Desserts"] = ["ice cream", "Cashew Fudge", "waffels", "Carrot Halwa"]
print(studentData)

In [None]:
# 18. Sort the 'studentData' DataFrame in Ascending order -- Sorting by column 'Name' and then "address"
studentDataSorted=studentData.sort_values([by='Name','Address'],[ascending=True,True])
print(studentDataSorted)

In [None]:
# 19. Save this `DataFrame` here below to disc as a `.CSV` file with the name `cows_and_goats.csv`:

#     ```python
#         df = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
#     ```
df = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
df.to_csv('cows_and_goats.csv')

In [24]:
# 20. (A) Using Pandas, make your own .
# CSV file with data on vegetables and save it. 
# (B) Using Pandas, make a change to your CSV file, 
# and save a copy with a different name.
veggies = pd.DataFrame({
    'Vegetable': ['Carrot', 'Broccoli', 'Spinach', 'Tomato'],
    'Color': ['Orange', 'Green', 'Green', 'Red'],
   })
veggies.to_csv('vegetables.csv', index=False)
vegetables=pd.read_csv('vegetables.csv')
print(vegetables)

vegetables['Is_leafy']=[False,False,True,False]
vegetables.to_csv('vegetables_New.csv',index=False)
new_vegetable=pd.read_csv('vegetables_New.csv')
print(new_vegetable)

  Vegetable   Color
0    Carrot  Orange
1  Broccoli   Green
2   Spinach   Green
3    Tomato     Red
  Vegetable   Color  Is_leafy
0    Carrot  Orange     False
1  Broccoli   Green     False
2   Spinach   Green      True
3    Tomato     Red     False
