# Section 01: Joining Tables
### `01-What columns would you join on?`
You'll be joining together the `parts` and `part_categories` tables. You can first inspect them in the console. To join these two tables together using the inner_join verb, what columns would you join from each table?


==> `c("part_cat_id" = "id")`

In [1]:
library(dplyr)
parts <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\parts.csv", 
                    header=TRUE)
part_categories <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\part_categories.csv", 
                    header=TRUE)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### `02-Joining parts and part categories`

- Add the correct joining verb, the name of the second table, and the joining column for the second table.
- Now, use the `suffix` argument to add `"_part"` and `"_category"` suffixes to replace the `name.x` and `name.y` fields.






In [2]:
# Use the suffix argument to replace .x and .y suffixes
parts %>% 
  inner_join(part_categories, by = c("part_cat_id" = "id"),
             suffix = c("_part", "_category"))

part_num,name_part,part_cat_id,part_material,name_category
<chr>,<chr>,<int>,<chr>,<chr>
003381,Sticker Sheet for Set 663-1,58,Plastic,Stickers
003383,"Sticker Sheet for Sets 618-1, 628-2",58,Plastic,Stickers
003402,"Sticker Sheet for Sets 310-3, 311-1, 312-3",58,Plastic,Stickers
003429,Sticker Sheet for Set 1550-1,58,Plastic,Stickers
003432,"Sticker Sheet for Sets 357-1, 355-1, 940-1",58,Plastic,Stickers
003434,"Sticker Sheet for Set 575-2, 653-1, 460-1",58,Plastic,Stickers
003435,Sticker Sheet for Set 687-1,58,Plastic,Stickers
003436,Sticker Sheet for Set 180-1,58,Plastic,Stickers
003437,Sticker Sheet for Set 181-1,58,Plastic,Stickers
003438,Sticker Sheet for Set 131-1,58,Plastic,Stickers


### `03-Joining parts and inventories`
Let's join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.


- Connect the `parts` and `inventory_parts` tables by their part numbers using an inner join.

In [6]:
inventory_parts <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\inventory_parts.csv", 
                    header=TRUE)

In [16]:
str(parts)
separator <- strrep("---", 30)
print(separator)
str(inventory_parts)

'data.frame':	47669 obs. of  4 variables:
 $ part_num     : chr  "003381" "003383" "003402" "003429" ...
 $ name         : chr  "Sticker Sheet for Set 663-1" "Sticker Sheet for Sets 618-1, 628-2" "Sticker Sheet for Sets 310-3, 311-1, 312-3" "Sticker Sheet for Set 1550-1" ...
 $ part_cat_id  : int  58 58 58 58 58 58 58 58 58 58 ...
 $ part_material: chr  "Plastic" "Plastic" "Plastic" "Plastic" ...
[1] "------------------------------------------------------------------------------------------"
'data.frame':	1038581 obs. of  5 variables:
 $ inventory_id: int  1 1 1 1 1 3 3 3 3 3 ...
 $ part_num    : chr  "48379c01" "48395" "stickerupn0077" "upn0342" ...
 $ color_id    : int  72 7 9999 0 25 47 29 2 15 15 ...
 $ quantity    : int  1 1 1 1 1 1 1 1 1 2 ...
 $ is_spare    : chr  "f" "f" "f" "f" ...


In [17]:
# Combine the parts and inventory_parts tables
parts %>%
  inner_join(inventory_parts, by = "part_num")

part_num,name,part_cat_id,part_material,inventory_id,color_id,quantity,is_spare
<chr>,<chr>,<int>,<chr>,<int>,<int>,<int>,<chr>
003381,Sticker Sheet for Set 663-1,58,Plastic,15865,9999,1,f
003383,"Sticker Sheet for Sets 618-1, 628-2",58,Plastic,8376,9999,1,f
003383,"Sticker Sheet for Sets 618-1, 628-2",58,Plastic,11738,9999,1,f
003402,"Sticker Sheet for Sets 310-3, 311-1, 312-3",58,Plastic,470,9999,1,f
003402,"Sticker Sheet for Sets 310-3, 311-1, 312-3",58,Plastic,885,9999,1,f
003402,"Sticker Sheet for Sets 310-3, 311-1, 312-3",58,Plastic,12659,9999,1,f
003429,Sticker Sheet for Set 1550-1,58,Plastic,3993,9999,1,f
003432,"Sticker Sheet for Sets 357-1, 355-1, 940-1",58,Plastic,8191,9999,1,f
003434,"Sticker Sheet for Set 575-2, 653-1, 460-1",58,Plastic,8756,9999,1,f
003434,"Sticker Sheet for Set 575-2, 653-1, 460-1",58,Plastic,12826,9999,1,f


### `04-Joining in either direction`
- Connect the `inventory_parts` and `parts` tables by their part numbers using an inner join.

In [19]:
# Combine the parts and inventory_parts tables
inventory_parts %>%
  inner_join(parts, by = "part_num")

inventory_id,part_num,color_id,quantity,is_spare,name,part_cat_id,part_material
<int>,<chr>,<int>,<int>,<chr>,<chr>,<int>,<chr>
1,48379c01,72,1,f,"Large Figure Torso and Legs, Promo Figure Base with Feet",41,Plastic
1,48395,7,1,f,Sports Snowboard from McDonald's Promotional Set,27,Plastic
1,stickerupn0077,9999,1,f,Sticker Sheet for Set 7922-1,58,Plastic
1,upn0342,0,1,f,Sports Promo Paddle from McDonald's Sports Sets,27,Plastic
1,upn0350,25,1,f,Sports Promo Figure Head Torso Assembly McDonald's Set 6 (7922),13,Plastic
3,2343,47,1,f,Equipment Goblet / Glass,27,Plastic
3,3003,29,1,f,Brick 2 x 2,11,Plastic
3,30176,2,1,f,"Plant, Round 1 x 1 - 3 Bamboo Leaves",28,Plastic
3,3020,15,1,f,Plate 2 x 4,14,Plastic
3,3022,15,2,f,Plate 2 x 2,14,Plastic


### `05-Joining three tables`
- Combine the `inventories` table with the `sets` table.
- Next, join the `inventory_parts` table to the table you created in the previous join by the inventory IDs.

In [25]:
inventories <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\inventories.csv", 
                    header=TRUE)
sets <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\sets.csv", 
                    header=TRUE)
colors <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\colors.csv", 
                    header=TRUE)

In [40]:
sets %>%
  # Add inventories using an inner join 
  inner_join(inventories, by = "set_num") %>%
  # Add inventory_parts using an inner join 
  inner_join(inventory_parts, by = c("id" = "inventory_id"))

set_num,name,year,theme_id,num_parts,id,version,part_num,color_id,quantity,is_spare
<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<int>,<chr>
001-1,Gears,1965,1,43,24696,1,132a,7,4,f
001-1,Gears,1965,1,43,24696,1,3020,15,4,f
001-1,Gears,1965,1,43,24696,1,3062c,15,1,f
001-1,Gears,1965,1,43,24696,1,3404bc01,15,4,f
001-1,Gears,1965,1,43,24696,1,36,7,4,f
001-1,Gears,1965,1,43,24696,1,7039,4,6,f
001-1,Gears,1965,1,43,24696,1,7049b,15,4,f
001-1,Gears,1965,1,43,24696,1,715,4,4,f
001-1,Gears,1965,1,43,24696,1,741,15,4,f
001-1,Gears,1965,1,43,24696,1,742,14,4,f


### `06-What's the most common color?`

- Inner join the `colors` table using the `color_id` column from the previous join and the `id` column from `colors`; use the suffixes `"_set"` and `"_color"`

- Count the `name_color` column and sort the results so the most prominent colors appear first.

In [42]:
# Count the number of colors and sort
sets %>%
  inner_join(inventories, by = "set_num") %>%
  inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
  inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
  count(name_color) %>%
  arrange(desc(n))

name_color,n
<chr>,<int>
Black,177681
White,111542
Light Bluish Gray,106366
Dark Bluish Gray,78541
Red,76804
Yellow,52875
Blue,42265
Reddish Brown,30955
Tan,28885
Light Gray,27654


### `The End`