## These are the Basic steps for importing / manipulating Data with Pandas

### Importing
* Import CSV

* Import TXT

* Import JSON

* Import Excel

### Basic Manipulation
* Shape

* Info

* Describe

* Head

* Tail

* .loc *(indexing)*

* .iloc *(indexing)*



In [22]:
import pandas as pd

## Dealing with regex syntax function of a regular \ backslash:


### In windows you can either use Double backslashes
```
Example: df = pd.read_csv("F:\\Data Analyst\\data sets\\Alex Pandas\\pandas basics\\countries of the world.csv")
```

### OR you can use single forward slash
```
Example: df = pd.read_csv("F:/Data Analyst/data sets/Alex Pandas/pandas basics/countries of the world.csv")
```

### OR the best way is to put an r so it reads the path as raw text and ignores the syntax of normal \
```
Example: df = pd.read_csv(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.csv")
```

----------------------------

## REMOVING HEADERS

### If you want to remove headers you can use header = None
```
Example: df = pd.read_csv(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.csv", header = None)
```

### If you want to specify the names of header columns you can use names = []
```
Example: df = pd.read_csv(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.csv", names =['Country', 'Region'])
```

In [23]:
# reading in the csv dataframe

df = pd.read_csv(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.csv")
df

Unnamed: 0,Country,Region
0,Afghanistan,ASIA (EX. NEAR EAST)
1,Albania,EASTERN EUROPE
2,Algeria,NORTHERN AFRICA
3,American Samoa,OCEANIA
4,Andorra,WESTERN EUROPE
...,...,...
222,West Bank,NEAR EAST
223,Western Sahara,NORTHERN AFRICA
224,Yemen,NEAR EAST
225,Zambia,SUB-SAHARAN AFRICA


### You can import txt files with csv. But formatting will not be correct.


#### You could fix it by using a seperator
```
Example: txt_df = pd.read_csv(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.txt", sep = '\t') 
```

#### The best way to import txt files is by using read_table

In [24]:
# Import the txt file using read_table

txt_df = pd.read_table(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\countries of the world.txt") 
txt_df

Unnamed: 0,Country,Region
0,Afghanistan,ASIA (EX. NEAR EAST)
1,Albania,EASTERN EUROPE
2,Algeria,NORTHERN AFRICA
3,American Samoa,OCEANIA
4,Andorra,WESTERN EUROPE
...,...,...
222,West Bank,NEAR EAST
223,Western Sahara,NORTHERN AFRICA
224,Yemen,NEAR EAST
225,Zambia,SUB-SAHARAN AFRICA


In [29]:
# json is structured data. You can load it in using read_json

json_df = pd.read_json(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\json_sample.json") 
json_df

Unnamed: 0,12 Strong,A Fantastic Woman (Una Mujer Fantástica),All The Money In The World,Bilal: A New Breed Of Hero,Call Me By Your Name,Darkest Hour,Den Of Thieves,Ferdinand,Fifty Shades Freed,Film Stars Don'T Die In Liverpool,Forever My Girl,Golden Exits,Hostiles,"I, Tonya",Insidious: The Last Key,Jumanji: Welcome To The Jungle,Mary And The Witch'S Flower,Maze Runner: The Death Cure,Molly'S Game,Paddington 2,Padmaavat,Permission,Peter Rabbit,Phantom Thread,Pitch Perfect 3,Proud Mary,Sanpo Suru Shinryakusha,Star Wars: The Last Jedi,The 15:17 To Paris,The Commuter,The Disaster Artist,The Greatest Showman,The Insult (L'Insulte),The Post,The Shape Of Water,"Three Billboards Outside Ebbing, Missouri",Till The End Of The World,Winchester
0,"{'Genre': 'Action', 'Gross': '$453,173', 'IMDB...","{'popcornscore': 83, 'rating': 'R', 'tomatosco...","{'popcornscore': 71, 'rating': 'R', 'tomatosco...","{'popcornscore': 91, 'rating': 'PG13', 'tomato...","{'popcornscore': 87, 'rating': 'R', 'tomatosco...","{'popcornscore': 84, 'rating': 'PG13', 'tomato...","{'Genre': 'Action', 'Gross': '$491,898', 'IMDB...","{'popcornscore': 49, 'rating': 'PG', 'tomatosc...","{'Genre': 'Drama', 'Gross': 'unknown', 'IMDB M...","{'popcornscore': 69, 'rating': 'R', 'tomatosco...","{'popcornscore': 91, 'rating': 'PG', 'tomatosc...","{'Genre': 'Drama', 'Gross': 'unknown', 'IMDB M...","{'Genre': 'Adventure', 'Gross': '$548,886', 'I...","{'popcornscore': 89, 'rating': 'R', 'tomatosco...","{'popcornscore': 51, 'rating': 'PG13', 'tomato...","{'Genre': 'Action', 'Gross': '$760,867', 'IMDB...","{'popcornscore': 78, 'rating': 'PG', 'tomatosc...","{'Genre': 'Action', 'Gross': '$720,463', 'IMDB...","{'popcornscore': 85, 'rating': 'R', 'tomatosco...","{'Genre': 'Animation', 'Gross': '$184,414', 'I...","{'popcornscore': 62, 'rating': 'NR', 'tomatosc...","{'Genre': 'Comedy', 'Gross': 'unknown', 'IMDB ...","{'Genre': 'Animation', 'Gross': 'unknown', 'IM...","{'popcornscore': 68, 'rating': 'R', 'tomatosco...","{'popcornscore': 52, 'rating': 'PG13', 'tomato...","{'popcornscore': 56, 'rating': 'R', 'tomatosco...","{'Genre': 'Drama', 'Gross': 'unknown', 'IMDB M...","{'popcornscore': 48, 'rating': 'PG13', 'tomato...","{'Genre': 'Drama', 'Gross': 'unknown', 'IMDB M...","{'popcornscore': 48, 'rating': 'PG13', 'tomato...","{'popcornscore': 89, 'rating': 'R', 'tomatosco...","{'Genre': 'Biography', 'Gross': '$627,248', 'I...","{'popcornscore': 86, 'rating': 'R', 'tomatosco...","{'Genre': 'Biography', 'Gross': '$463,228', 'I...","{'Genre': 'Adventure', 'Gross': '$448,287', 'I...","{'popcornscore': 87, 'rating': 'R', 'tomatosco...","{'popcornscore': -1, 'rating': 'NR', 'tomatosc...","{'Genre': 'Biography', 'Gross': '$696,786', 'I..."


### Excel files can easily be loaded in using read_excel

#### By default read_excel will read in the first sheet. You can load in other sheets using sheet_name =
```
Example: xl_df = pd.read_excel(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\world_population_excel_workbook.xlsx", sheet_name = 'Sheet2') 
```

In [26]:
# Lets bring in the desired sheet from our Excel file. Sheet 1 is actually the name of the second sheet, but it is the one we want

xl_df = pd.read_excel(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\world_population_excel_workbook.xlsx", sheet_name = 'Sheet1') 
xl_df

Unnamed: 0,Rank,CCA3,Country,Capital
0,36,AFG,Afghanistan,Kabul
1,138,ALB,Albania,Tirana
2,34,DZA,Algeria,Algiers
3,213,ASM,American Samoa,Pago Pago
4,203,AND,Andorra,Andorra la Vella
...,...,...,...,...
229,226,WLF,Wallis and Futuna,Mata-Utu
230,172,ESH,Western Sahara,El AaiÃºn
231,46,YEM,Yemen,Sanaa
232,63,ZMB,Zambia,Lusaka


### By default Pandas only shows the first few and the last few bits of data in your dataset.

#### You can alter this by using 
```
pd.set_option('display.max.rows', X)
```

##### _X being the number you want to set the option to_


#### You can also alter the Max columns using
```
pd.set_option('display.max.columns', X)
```

In [79]:
# First we run our set.option to alter the default setting in Pandas

## To see ALL of our rows we would use pd.set_option('display.max.rows', 250)
### For the sake of keeping this clean I will leave it at default

xl_df = pd.read_excel(r"F:\Data Analyst\data sets\Alex Pandas\pandas basics\world_population_excel_workbook.xlsx", sheet_name = 'Sheet1') 
xl_df

Unnamed: 0,Rank,CCA3,Country,Capital
0,36,AFG,Afghanistan,Kabul
1,138,ALB,Albania,Tirana
2,34,DZA,Algeria,Algiers
3,213,ASM,American Samoa,Pago Pago
4,203,AND,Andorra,Andorra la Vella
...,...,...,...,...
229,226,WLF,Wallis and Futuna,Mata-Utu
230,172,ESH,Western Sahara,El AaiÃºn
231,46,YEM,Yemen,Sanaa
232,63,ZMB,Zambia,Lusaka


## Functions to view info about your Data

* **Shape** Will give you the basic Column and Row count

* **Info** will give you data types and counts. It will also tell you memory usage, which can be very important if you are working with Large datasets.

* **Describe** will give you basic mathematical breakdowns of your data (count, mean, min, max, etc.)

* **Head** Will give you the first 5 rows, Unless you specify a different amount

* **Tail** Will give you the last 5 rows, Unless you specify a different amount

In [80]:
xl_df.shape

(234, 4)

In [81]:
xl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Rank     234 non-null    int64 
 1   CCA3     234 non-null    object
 2   Country  234 non-null    object
 3   Capital  234 non-null    object
dtypes: int64(1), object(3)
memory usage: 7.4+ KB


In [82]:
xl_df.describe()

Unnamed: 0,Rank
count,234.0
mean,117.5
std,67.694165
min,1.0
25%,59.25
50%,117.5
75%,175.75
max,234.0


In [83]:
xl_df.head(10)

Unnamed: 0,Rank,CCA3,Country,Capital
0,36,AFG,Afghanistan,Kabul
1,138,ALB,Albania,Tirana
2,34,DZA,Algeria,Algiers
3,213,ASM,American Samoa,Pago Pago
4,203,AND,Andorra,Andorra la Vella
5,42,AGO,Angola,Luanda
6,224,AIA,Anguilla,The Valley
7,201,ATG,Antigua and Barbuda,Saint Johnâ€™s
8,33,ARG,Argentina,Buenos Aires
9,140,ARM,Armenia,Yerevan


In [84]:
xl_df.tail(10)

Unnamed: 0,Rank,CCA3,Country,Capital
224,43,UZB,Uzbekistan,Tashkent
225,181,VUT,Vanuatu,Port-Vila
226,234,VAT,Vatican City,Vatican City
227,51,VEN,Venezuela,Caracas
228,16,VNM,Vietnam,Hanoi
229,226,WLF,Wallis and Futuna,Mata-Utu
230,172,ESH,Western Sahara,El AaiÃºn
231,46,YEM,Yemen,Sanaa
232,63,ZMB,Zambia,Lusaka
233,74,ZWE,Zimbabwe,Harare


## .loc and .iloc

* **.loc** looks at that specific index and get all of its details, .loc only works on whatever is currently used as the index

* **.iloc** look at a specific int index location and get its details, iloc is a number and will pull the int number no matter what is used for the current index

In [85]:
# .loc works on whatever is currently set as the index, integer or not.

xl_df.loc[224]

Rank               43
CCA3              UZB
Country    Uzbekistan
Capital      Tashkent
Name: 224, dtype: object

In [86]:
# Lets change what we are using to index the data so we can show the difference between .loc and .iloc
# Currently the df is indexed by integers

# Lets change the index to use Country instead

xl_df = xl_df.set_index('Country')

xl_df

Unnamed: 0_level_0,Rank,CCA3,Capital
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,36,AFG,Kabul
Albania,138,ALB,Tirana
Algeria,34,DZA,Algiers
American Samoa,213,ASM,Pago Pago
Andorra,203,AND,Andorra la Vella
...,...,...,...
Wallis and Futuna,226,WLF,Mata-Utu
Western Sahara,172,ESH,El AaiÃºn
Yemen,46,YEM,Sanaa
Zambia,63,ZMB,Lusaka


In [87]:
# Now we can use .loc on the new index item.

xl_df.loc['Uzbekistan']

Rank             43
CCA3            UZB
Capital    Tashkent
Name: Uzbekistan, dtype: object

In [88]:
# Even though the index is currently by country, .iloc can still be used to define a specific location by integer.

## if you try to run .loc on the integer that is not currently the Index you will get an error

xl_df.iloc[224]

Rank             43
CCA3            UZB
Capital    Tashkent
Name: Uzbekistan, dtype: object

## There is so much more to importing and manipulating data. 

- ### These basics should give you a good starting point as you further explore Python and Data analytics