-
Notifications
You must be signed in to change notification settings - Fork 0
/
documents.json
4757 lines (4757 loc) · 652 KB
/
documents.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[
{
"course": "data-engineering-zoomcamp",
"documents": [
{
"text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
"section": "General course-related questions",
"question": "Course - When will the course start?"
},
{
"text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",
"section": "General course-related questions",
"question": "Course - What are the prerequisites for this course?"
},
{
"text": "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
"section": "General course-related questions",
"question": "Course - Can I still join the course after the start date?"
},
{
"text": "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
"section": "General course-related questions",
"question": "Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?"
},
{
"text": "You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.",
"section": "General course-related questions",
"question": "Course - What can I do before the course starts?"
},
{
"text": "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp \u201clive\u201d cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you\u2019re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any \u201clive\u201d cohort.",
"section": "General course-related questions",
"question": "Course - how many Zoomcamps in a year?"
},
{
"text": "Yes. For the 2024 edition we are using Mage AI instead of Prefect and re-recorded the terraform videos, For 2023, we used Prefect instead of Airflow..",
"section": "General course-related questions",
"question": "Course - Is the current cohort going to be different from the previous cohort?"
},
{
"text": "Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.",
"section": "General course-related questions",
"question": "Course - Can I follow the course after it finishes?"
},
{
"text": "Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don\u2019t rely on its answers 100%, it is pretty good though.",
"section": "General course-related questions",
"question": "Course - Can I get support if I take the course in the self-paced mode?"
},
{
"text": "All the main videos are stored in the Main \u201cDATA ENGINEERING\u201d playlist (no year specified). The Github repository has also been updated to show each video with a thumbnail, that would bring you directly to the same playlist below.\nBelow is the MAIN PLAYLIST\u2019. And then you refer to the year specific playlist for additional videos for that year like for office hours videos etc. Also find this playlist pinned to the slack channel.\nh\nttps://youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&si=NspQhtZhZQs1B9F-",
"section": "General course-related questions",
"question": "Course - Which playlist on YouTube should I refer to?"
},
{
"text": "It depends on your background and previous experience with modules. It is expected to require about 5 - 15 hours per week. [source1] [source2]\nYou can also calculate it yourself using this data and then update this answer.",
"section": "General course-related questions",
"question": "Course - \u200b\u200bHow many hours per week am I expected to spend on this course?"
},
{
"text": "No, you can only get a certificate if you finish the course with a \u201clive\u201d cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.",
"section": "General course-related questions",
"question": "Certificate - Can I follow the course in a self-paced mode and get a certificate?"
},
{
"text": "The zoom link is only published to instructors/presenters/TAs.\nStudents participate via Youtube Live and submit questions to Slido (link would be pinned in the chat when Alexey goes Live). The video URL should be posted in the announcements channel on Telegram & Slack before it begins. Also, you will see it live on the DataTalksClub YouTube Channel.\nDon\u2019t post your questions in chat as it would be off-screen before the instructors/moderators have a chance to answer it if the room is very active.",
"section": "General course-related questions",
"question": "Office Hours - What is the video/zoom link to the stream for the \u201cOffice Hour\u201d or workshop sessions?"
},
{
"text": "Yes! Every \u201cOffice Hours\u201d will be recorded and available a few minutes after the live session is over; so you can view (or rewatch) whenever you want.",
"section": "General course-related questions",
"question": "Office Hours - I can\u2019t attend the \u201cOffice hours\u201d / workshop, will it be recorded?"
},
{
"text": "You can find the latest and up-to-date deadlines here: https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml\nAlso, take note of Announcements from @Au-Tomator for any extensions or other news. Or, the form may also show the updated deadline, if Instructor(s) has updated it.",
"section": "General course-related questions",
"question": "Homework - What are homework and project deadlines?"
},
{
"text": "No, late submissions are not allowed. But if the form is still not closed and it\u2019s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]",
"section": "General course-related questions",
"question": "Homework - Are late submissions of homework allowed?"
},
{
"text": "Answer: In short, it\u2019s your repository on github, gitlab, bitbucket, etc\nIn long, your repository or any other location you have your code where a reasonable person would look at it and think yes, you went through the week and exercises.",
"section": "General course-related questions",
"question": "Homework - What is the homework URL in the homework link?"
},
{
"text": "After you submit your homework it will be graded based on the amount of questions in a particular homework. You can see how many points you have right on the page of the homework up top. Additionally in the leaderboard you will find the sum of all points you\u2019ve earned - points for Homeworks, FAQs and Learning in Public. If homework is clear, others work as follows: if you submit something to FAQ, you get one point, for each learning in a public link you get one point.\n(https://datatalks-club.slack.com/archives/C01FABYF2RG/p1706846846359379?thread_ts=1706825019.546229&cid=C01FABYF2RG)",
"section": "General course-related questions",
"question": "Homework and Leaderboard - what is the system for points in the course management platform?"
},
{
"text": "When you set up your account you are automatically assigned a random name such as \u201cLucid Elbakyan\u201d for example. If you want to see what your Display name is.\nGo to the Homework submission link \u2192 https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2 - Log in > Click on \u2018Data Engineering Zoom Camp 2024\u2019 > click on \u2018Edit Course Profile\u2019 - your display name is here, you can also change it should you wish:",
"section": "General course-related questions",
"question": "Leaderboard - I am not on the leaderboard / how do I know which one I am on the leaderboard?"
},
{
"text": "Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]\nBut Python 3.10 and 3.11 should work fine.",
"section": "General course-related questions",
"question": "Environment - Is Python 3.9 still the recommended version to use in 2024?"
},
{
"text": "You can set it up on your laptop or PC if you prefer to work locally from your laptop or PC.\nYou might face some challenges, especially for Windows users. If you face cnd2\nIf you prefer to work on the local machine, you may start with the week 1 Introduction to Docker and follow through.\nHowever, if you prefer to set up a virtual machine, you may start with these first:\nUsing GitHub Codespaces\nSetting up the environment on a cloudV Mcodespace\nI decided to work on a virtual machine because I have different laptops & PCs for my home & office, so I can work on this boot camp virtually anywhere.",
"section": "General course-related questions",
"question": "Environment - Should I use my local machine, GCP, or GitHub Codespaces for my environment?"
},
{
"text": "GitHub Codespaces offers you computing Linux resources with many pre-installed tools (Docker, Docker Compose, Python).\nYou can also open any GitHub repository in a GitHub Codespace.",
"section": "General course-related questions",
"question": "Environment - Is GitHub codespaces an alternative to using cli/git bash to ingest the data and create a docker file?"
},
{
"text": "It's up to you which platform and environment you use for the course.\nGithub codespaces or GCP VM are just possible options, but you can do the entire course from your laptop.",
"section": "General course-related questions",
"question": "Environment - Do we really have to use GitHub codespaces? I already have PostgreSQL & Docker installed."
},
{
"text": "Choose the approach that aligns the most with your idea for the end project\nOne of those should suffice. However, BigQuery, which is part of GCP, will be used, so learning that is probably a better option. Or you can set up a local environment for most of this course.",
"section": "General course-related questions",
"question": "Environment - Do I need both GitHub Codespaces and GCP?"
},
{
"text": "1. To open Run command window, you can either:\n(1-1) Use the shortcut keys: 'Windows + R', or\n(1-2) Right Click \"Start\", and click \"Run\" to open.\n2. Registry Values Located in Registry Editor, to open it: Type 'regedit' in the Run command window, and then press Enter.' 3. Now you can change the registry values \"Autorun\" in \"HKEY_CURRENT_USER\\Software\\Microsoft\\Command Processor\" from \"if exists\" to a blank.\nAlternatively, You can simplify the solution by deleting the fingerprint saved within the known_hosts file. In Windows, this file is placed at C:\\Users\\<your_user_name>\\.ssh\\known_host",
"section": "General course-related questions",
"question": "This happens when attempting to connect to a GCP VM using VSCode on a Windows machine. Changing registry value in registry editor"
},
{
"text": "For uniformity at least, but you\u2019re not restricted to GCP, you can use other cloud platforms like AWS if you\u2019re comfortable with other cloud platforms, since you get every service that\u2019s been provided by GCP in Azure and AWS or others..\nBecause everyone has a google account, GCP has a free trial period and gives $300 in credits to new users. Also, we are working with BigQuery, which is a part of GCP.\nNote that to sign up for a free GCP account, you must have a valid credit card.",
"section": "General course-related questions",
"question": "Environment - Why are we using GCP and not other cloud providers?"
},
{
"text": "No, if you use GCP and take advantage of their free trial.",
"section": "General course-related questions",
"question": "Should I pay for cloud services?"
},
{
"text": "You can do most of the course without a cloud. Almost everything we use (excluding BigQuery) can be run locally. We won\u2019t be able to provide guidelines for some things, but most of the materials are runnable without GCP.\nFor everything in the course, there\u2019s a local alternative. You could even do the whole course locally.",
"section": "General course-related questions",
"question": "Environment - The GCP and other cloud providers are unavailable in some countries. Is it possible to provide a guide to installing a home lab?"
},
{
"text": "Yes, you can. Just remember to adapt all the information on the videos to AWS. Besides, the final capstone will be evaluated based on the task: Create a data pipeline! Develop a visualisation!\nThe problem would be when you need help. You\u2019d need to rely on fellow coursemates who also use AWS (or have experience using it before), which might be in smaller numbers than those learning the course with GCP.\nAlso see Is it possible to use x tool instead of the one tool you use?",
"section": "General course-related questions",
"question": "Environment - I want to use AWS. May I do that?"
},
{
"text": "We will probably have some calls during the Capstone period to clear some questions but it will be announced in advance if that happens.",
"section": "General course-related questions",
"question": "Besides the \u201cOffice Hour\u201d which are the live zoom calls?"
},
{
"text": "We will use the same data, as the project will essentially remain the same as last year\u2019s. The data is available here",
"section": "General course-related questions",
"question": "Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?"
},
{
"text": "No, but we moved the 2022 stuff here",
"section": "General course-related questions",
"question": "Is the 2022 repo deleted?"
},
{
"text": "Yes, you can use any tool you want for your project.",
"section": "General course-related questions",
"question": "Can I use Airflow instead for my final project?"
},
{
"text": "Yes, this applies if you want to use Airflow or Prefect instead of Mage, AWS or Snowflake instead of GCP products or Tableau instead of Metabase or Google data studio.\nThe course covers 2 alternative data stacks, one using GCP and one using local installation of everything. You can use one of them or use your tool of choice.\nShould you consider it instead of the one tool you use? That we can\u2019t support you if you choose to use a different stack, also you would need to explain the different choices of tool for the peer review of your capstone project.",
"section": "General course-related questions",
"question": "Is it possible to use tool \u201cX\u201d instead of the one tool you use in the course?"
},
{
"text": "Star the repo! Share it with friends if you find it useful \u2763\ufe0f\nCreate a PR if you see you can improve the text or the structure of the repository.",
"section": "General course-related questions",
"question": "How can we contribute to the course?"
},
{
"text": "Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully",
"section": "General course-related questions",
"question": "Environment - Is the course [Windows/mac/Linux/...] friendly?"
},
{
"text": "Have no idea how past cohorts got past this as I haven't read old slack messages, and no FAQ entries that I can find.\nLater modules (module-05 & RisingWave workshop) use shell scripts in *.sh files and most Windows users not using WSL would hit a wall and cannot continue, even in git bash or MINGW64. This is why WSL environment setup is recommended from the start.",
"section": "General course-related questions",
"question": "Environment - Roadblock for Windows users in modules with *.sh (shell scripts)."
},
{
"text": "Yes to both! check out this document: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md",
"section": "General course-related questions",
"question": "Any books or additional resources you recommend?"
},
{
"text": "You will have two attempts for a project. If the first project deadline is over and you\u2019re late or you submit the project and fail the first attempt, you have another chance to submit the project with the second attempt.",
"section": "General course-related questions",
"question": "Project - What is Project Attemp #1 and Project Attempt #2 exactly?"
},
{
"text": "The first step is to try to solve the issue on your own. Get used to solving problems and reading documentation. This will be a real life skill you need when employed. [ctrl+f] is your friend, use it! It is a universal shortcut and works in all apps/browsers.\nWhat does the error say? There will often be a description of the error or instructions on what is needed or even how to fix it. I have even seen a link to the solution. Does it reference a specific line of your code?\nRestart app or server/pc.\nGoogle it, use ChatGPT, Bing AI etc.\nIt is going to be rare that you are the first to have the problem, someone out there has posted the fly issue and likely the solution.\nSearch using: <technology> <problem statement>. Example: pgcli error column c.relhasoids does not exist.\nThere are often different solutions for the same problem due to variation in environments.\nCheck the tech\u2019s documentation. Use its search if available or use the browsers search function.\nTry uninstall (this may remove the bad actor) and reinstall of application or reimplementation of action. Remember to restart the server/pc for reinstalls.\nSometimes reinstalling fails to resolve the issue but works if you uninstall first.\nPost your question to Stackoverflow. Read the Stackoverflow guide on posting good questions.\nhttps://stackoverflow.com/help/how-to-ask\nThis will be your real life. Ask an expert in the future (in addition to coworkers).\nAsk in Slack\nBefore asking a question,\nCheck Pins (where the shortcut to the repo and this FAQ is located)\nUse the slack app\u2019s search function\nUse the bot @ZoomcampQABot to do the search for you\ncheck the FAQ (this document), use search [ctrl+f]\nWhen asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.\nDO NOT use screenshots, especially don\u2019t take pictures from a phone.\nDO NOT tag instructors, it may discourage others from helping you. Copy and paste errors; if it\u2019s long, just post it in a reply to your thread.\nUse ``` for formatting your code.\nUse the same thread for the conversation (that means reply to your own thread).\nDO NOT create multiple posts to discuss the issue.\nlearYou may create a new post if the issue reemerges down the road. Describe what has changed in the environment.\nProvide additional information in the same thread of the steps you have taken for resolution.\nTake a break and come back later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day.\nRemember technology issues in real life sometimes take days or even weeks to resolve.\nIf somebody helped you with your problem and it's not in the FAQ, please add it there. It will help other students.",
"section": "General course-related questions",
"question": "How to troubleshoot issues"
},
{
"text": "When the troubleshooting guide above does not help resolve it and you need another pair of eyeballs to spot mistakes. When asking a question, include as much information as possible:\nWhat are you coding on? What OS?\nWhat command did you run, which video did you follow? Etc etc\nWhat error did you get? Does it have a line number to the \u201coffending\u201d code and have you check it for typos?\nWhat have you tried that did not work? This answer is crucial as without it, helpers would ask you to do the suggestions in the error log first. Or just read this FAQ document.",
"section": "General course-related questions",
"question": "How to ask questions"
},
{
"text": "After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub\nHaving this local repository on your computer will make it easy for you to access the instructors\u2019 code and make pull requests (if you want to add your own notes or make changes to the course content).\nYou will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial that shows you how to do this: https://www.atlassian.com/git/tutorials/setting-up-a-repository\nRemember to ignore large database, .csv, and .gz files, and other files that should not be saved to a repository. Use .gitignore for this: https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or keys in a git repo (even if that repo is set to private).\nThis is also a great resource: https://dangitgit.com/",
"section": "General course-related questions",
"question": "How do I use Git / GitHub for this course?"
},
{
"text": "Error: Makefile:2: *** missing separator. Stop.\nSolution: Tabs in document should be converted to Tab instead of spaces. Follow this stack.",
"section": "General course-related questions",
"question": "VS Code: Tab using spaces"
},
{
"text": "If you\u2019re running Linux on Windows Subsystem for Linux (WSL) 2, you can open HTML files from the guest (Linux) with whatever Internet Browser you have installed on the host (Windows). Just install wslu and open the page with wslview <file>, for example:\nwslview index.html\nYou can customise which browser to use by setting the BROWSER environment variable first. For example:\nexport BROWSER='/mnt/c/Program Files/Firefox/firefox.exe'",
"section": "General course-related questions",
"question": "Opening an HTML file with a Windows browser from Linux running on WSL"
},
{
"text": "This tutorial shows you how to set up the Chrome Remote Desktop service on a Debian Linux virtual machine (VM) instance on Compute Engine. Chrome Remote Desktop allows you to remotely access applications with a graphical user interface.\nTaxi Data - Yellow Taxi Trip Records downloading error, Error no or XML error webpage\nWhen you try to download the 2021 data from TLC website, you get this error:\nIf you click on the link, and ERROR 403: Forbidden on the terminal.\nWe have a backup, so use it instead: https://github.com/DataTalksClub/nyc-tlc-data\nSo the link should be https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz\nNote: Make sure to unzip the \u201cgz\u201d file (no, the \u201cunzip\u201d command won\u2019t work for this.)\n\u201cgzip -d file.gz\u201dg",
"section": "Module 1: Docker and Terraform",
"question": "Set up Chrome Remote Desktop for Linux on Compute Engine"
},
{
"text": "In this video, we store the data file as \u201coutput.csv\u201d. The data file won\u2019t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = \u201coutput.cs -v\u201d with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = \u201coutput.csv\u201d with\ncsv_name = url.split(\u201c/\u201d)[-1] . Then when we use csv_name to using pd.read_csv, there won\u2019t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - How to handle taxi data files, now that the files are available as *.csv.gz?"
},
{
"text": "Yellow Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf\nGreen Trips: https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - Data Dictionary for NY Taxi data?"
},
{
"text": "You can unzip this downloaded parquet file, in the command line. The result is a csv file which can be imported with pandas using the pd.read_csv() shown in the videos.\n\u2018\u2019\u2019gunzip green_tripdata_2019-09.csv.gz\u2019\u2019\u2019\nSOLUTION TO USING PARQUET FILES DIRECTLY IN PYTHON SCRIPT ingest_data.py\nIn the def main(params) add this line\nparquet_name= 'output.parquet'\nThen edit the code which downloads the files\nos.system(f\"wget {url} -O {parquet_name}\")\nConvert the download .parquet file to csv and rename as csv_name to keep it relevant to the rest of the code\ndf = pd.read_parquet(parquet_name)\ndf.to_csv(csv_name, index=False)",
"section": "Module 1: Docker and Terraform",
"question": "Taxi Data - Unzip Parquet file"
},
{
"text": "\u201cwget is not recognized as an internal or external command\u201d, you need to install it.\nOn Ubuntu, run:\n$ sudo apt-get install wget\nOn MacOS, the easiest way to install wget is to use Brew:\n$ brew install wget\nOn Windows, the easiest way to install wget is to use Chocolatey:\n$ choco install wget\nOr you can download a binary (https://gnuwin32.sourceforge.net/packages/wget.htm) and put it to any location in your PATH (e.g. C:/tools/)\nAlso, you can following this step to install Wget on MS Windows\n* Download the latest wget binary for windows from [eternallybored] (https://eternallybored.org/misc/wget/) (they are available as a zip with documentation, or just an exe)\n* If you downloaded the zip, extract all (if windows built in zip utility gives an error, use [7-zip] (https://7-zip.org/)).\n* Rename the file `wget64.exe` to `wget.exe` if necessary.\n* Move wget.exe to your `Git\\mingw64\\bin\\`.\nAlternatively, you can use a Python wget library, but instead of simply using \u201cwget\u201d you\u2019ll need to use\npython -m wget\nYou need to install it with pip first:\npip install wget\nAlternatively, you can just paste the file URL into your web browser and download the file normally that way. You\u2019ll want to move the resulting file into your working directory.\nAlso recommended a look at the python library requests for the loading gz file https://pypi.org/project/requests",
"section": "Module 1: Docker and Terraform",
"question": "lwget is not recognized as an internal or external command"
},
{
"text": "Firstly, make sure that you add \u201c!\u201d before wget if you\u2019re running your command in a Jupyter Notebook or CLI. Then, you can check one of this 2 things (from CLI):\nUsing the Python library wget you installed with pip, try python -m wget <url>\nWrite the usual command and add --no-check-certificate at the end. So it should be:\n!wget <website_url> --no-check-certificate",
"section": "Module 1: Docker and Terraform",
"question": "wget - ERROR: cannot verify <website> certificate (MacOS)"
},
{
"text": "For those who wish to use the backslash as an escape character in Git Bash for Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\\ (no need to include in .bashrc)",
"section": "Module 1: Docker and Terraform",
"question": "Git Bash - Backslash as an escape character in Git Bash for Windows"
},
{
"text": "Instruction on how to store secrets that will be avialable in GitHub Codespaces.\nManaging your account-specific secrets for GitHub Codespaces - GitHub Docs",
"section": "Module 1: Docker and Terraform",
"question": "GitHub Codespaces - How to store secrets"
},
{
"text": "Make sure you're able to start the Docker daemon, and check the issue immediately down below:\nAnd don\u2019t forget to update the wsl in powershell the command is wsl \u2013update",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot connect to Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
},
{
"text": "As the official Docker for Windows documentation says, the Docker engine can either use the\nHyper-V or WSL2 as its backend. However, a few constraints might apply\nWindows 10 Pro / 11 Pro Users: \nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nWindows 10 Home / 11 Home Users: \nOn the other hand, Users of the 'Home' version do NOT have the option Hyper-V option enabled, which means, you can only get Docker up and running using the WSL2 credentials(Windows Subsystem for Linux). Url\nYou can find the detailed instructions to do so here: rt ghttps://pureinfotech.com/install-wsl-windows-11/\nIn case, you run into another issue while trying to install WSL2 (WslRegisterDistribution failed with error: 0x800701bc), Make sure you update the WSL2 Linux Kernel, following the guidelines here: \n\nhttps://github.com/microsoft/WSL/issues/5393",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post: \"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/create\" : open //./pipe/docker_engine: The system cannot find the file specified"
},
{
"text": "Whenever a `docker pull is performed (either manually or by `docker-compose up`), it attempts to fetch the given image name (pgadmin4, for the example above) from a repository (dbpage).\nIF the repository is public, the fetch and download happens without any issue whatsoever.\nFor instance:\ndocker pull postgres:13\ndocker pull dpage/pgadmin4\nBE ADVISED:\n\nThe Docker Images we'll be using throughout the Data Engineering Zoomcamp are all public (except when or if explicitly said otherwise by the instructors or co-instructors).\n\nMeaning: you are NOT required to perform a docker login to fetch them. \n\nSo if you get the message above saying \"docker login': denied: requested access to the resource is denied. That is most likely due to a typo in your image name:\n\nFor instance:\n$ docker pull dbpage/pgadmin4\nWill throw that exception telling you \"repository does not exist or may require 'docker login'\nError response from daemon: pull access denied for dbpage/pgadmin4, repository does not exist or \nmay require 'docker login': denied: requested access to the resource is denied\nBut that actually happened because the actual image is dpage/pgadmin4 and NOT dbpage/pgadmin4\nHow to fix it:\n$ docker pull dpage/pgadmin4\nEXTRA NOTES:\nIn the real world, occasionally, when you're working for a company or closed organisation, the Docker image you're trying to fetch might be under a private repo that your DockerHub Username was granted access to.\nFor which cases, you must first execute:\n$ docker login\nFill in the details of your username and password.\nAnd only then perform the `docker pull` against that private repository\nWhy am I encountering a \"permission denied\" error when creating a PostgreSQL Docker container for the New York Taxi Database with a mounted volume on macOS M1?\nIssue Description:\nWhen attempting to run a Docker command similar to the one below:\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\mount\npostgres:13\nYou encounter the error message:\ndocker: Error response from daemon: error while creating mount source path '/path/to/ny_taxi_postgres_data': chown /path/to/ny_taxi_postgres_data: permission denied.\nSolution:\n1- Stop Rancher Desktop:\nIf you are using Rancher Desktop and face this issue, stop Rancher Desktop to resolve compatibility problems.\n2- Install Docker Desktop:\nInstall Docker Desktop, ensuring that it is properly configured and has the required permissions.\n2-Retry Docker Command:\nRun the Docker command again after switching to Docker Desktop. This step resolves compatibility issues on some systems.\nNote: The issue occurred because Rancher Desktop was in use. Switching to Docker Desktop resolves compatibility problems and allows for the successful creation of PostgreSQL containers with mounted volumes for the New York Taxi Database on macOS M1.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - docker pull dbpage"
},
{
"text": "When I runned command to create postgre in docker container it created folder on my local machine to mount it to volume inside container. It has write and read protection and owned by user 999, so I could not delete it by simply drag to trash. My obsidian could not started due to access error, so I had to change placement of this folder and delete old folder by this command:\nsudo rm -r -f docker_test/\n- where `rm` - remove, `-r` - recursively, `-f` - force, `docker_test/` - folder.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - can\u2019t delete local folder that mounted to docker volume"
},
{
"text": "First off, make sure you're running the latest version of Docker for Windows, which you can download from here. Sometimes using the menu to \"Upgrade\" doesn't work (which is another clear indicator for you to uninstall, and reinstall with the latest version)\nIf docker is stuck on starting, first try to switch containers by right clicking the docker symbol from the running programs and switch the containers from windows to linux or vice versa\n[Windows 10 / 11 Pro Edition] The Pro Edition of Windows can run Docker either by using Hyper-V or WSL2 as its backend (Docker Engine)\nIn order to use Hyper-V as its back-end, you MUST have it enabled first, which you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11\nIf you opt-in for WSL2, you can follow the same steps as detailed in the tutorial here",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Docker won't start or is stuck in settings (Windows 10 / 11)"
},
{
"text": "It is recommended by the Docker do\n[Windows 10 / 11 Home Edition] If you're running a Home Edition, you can still make it work with WSL2 (Windows Subsystem for Linux) by following the tutorial here\nIf even after making sure your WSL2 (or Hyper-V) is set up accordingly, Docker remains stuck, you can try the option to Reset to Factory Defaults or do a fresh install.",
"section": "Module 1: Docker and Terraform",
"question": "Should I run docker commands from the windows file system or a file system of a Linux distribution in WSL?"
},
{
"text": "More info in the Docker Docs on Best Practises",
"section": "Module 1: Docker and Terraform",
"question": "Docker - cs to store all code in your default Linux distro to get the best out of file system performance (since Docker runs on WSL2 backend by default for Windows 10 Home / Windows 11 Home users)."
},
{
"text": "You may have this error:\n$ docker run -it ubuntu bash\nthe input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'\nerror:\nSolution:\nUse winpty before docker command (source)\n$ winpty docker run -it ubuntu bash\nYou also can make an alias:\necho \"alias docker='winpty docker'\" >> ~/.bashrc\nOR\necho \"alias docker='winpty docker'\" >> ~/.bash_profile",
"section": "Module 1: Docker and Terraform",
"question": "Docker - The input device is not a TTY (Docker run for Windows)"
},
{
"text": "You may have this error:\nRetrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u\nrllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':\n/simple/pandas/\nPossible solution might be:\n$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot pip install on Docker container (Windows)"
},
{
"text": "Even after properly running the docker script the folder is empty in the vs code then try this (For Windows)\nwinpty docker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"C:\\Users\\abhin\\dataengg\\DE_Project_git_connected\\DE_OLD\\week1_set_up\\docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nHere quoting the absolute path in the -v parameter is solving the issue and all the files are visible in the Vs-code ny_taxi folder as shown in the video",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ny_taxi_postgres_data is empty"
},
{
"text": "Check this article for details - Setting up docker in macOS\nFrom researching it seems this method might be out of date, it seems that since docker changed their licensing model, the above is a bit hit and miss. What worked for me was to just go to the docker website and download their dmg. Haven\u2019t had an issue with that method.",
"section": "Module 1: Docker and Terraform",
"question": "dasDocker - Setting up Docker on Mac"
},
{
"text": "$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"admin\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"/mnt/path/to/ny_taxi_postgres_data\":\"/var/lib/postgresql/data\" \\\n-p 5432:5432 \\\npostgres:13\nCCW\nThe files belonging to this database system will be owned by user \"postgres\".\nThis use The database cluster will be initialized with locale \"en_US.utf8\".\nThe default databerrorase encoding has accordingly been set to \"UTF8\".\nxt search configuration will be set to \"english\".\nData page checksums are disabled.\nfixing permissions on existing directory /var/lib/postgresql/data ... initdb: f\nerror: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nOne way to solve this issue is to create a local docker volume and map it to postgres data directory /var/lib/postgresql/data\nThe input dtc_postgres_volume_local must match in both commands below\n$ docker volume create --name dtc_postgres_volume_local -d local\n$ docker run -it\\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v dtc_postgres_volume_local:/var/lib/postgresql/data \\\n-p 5432:5432\\\npostgres:13\nTo verify the above command works in (WSL2 Ubuntu 22.04, verified 2024-Jan), go to the Docker Desktop app and look under Volumes - dtc_postgres_volume_local would be listed there. The folder ny_taxi_postgres_data would however be empty, since we used an alternative config.\nAn alternate error could be:\ninitdb: error: directory \"/var/lib/postgresql/data\" exists but is not empty\nIf you want to create a new database system, either remove or empthe directory \"/var/lib/postgresql/data\" or run initdb\nwitls",
"section": "Module 1: Docker and Terraform",
"question": "1Docker - Could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted"
},
{
"text": "Mapping volumes on Windows could be tricky. The way it was done in the course video doesn\u2019t work for everyone.\nFirst, if yo\nmove your data to some folder without spaces. E.g. if your code is in \u201cC:/Users/Alexey Grigorev/git/\u2026\u201d, move it to \u201cC:/git/\u2026\u201d\nTry replacing the \u201c-v\u201d part with one of the following options:\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n--volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data\nwinpty docker run -it\n-e POSTGRES_USER=\"root\"\n-e POSTGRES_PASSWORD=\"root\"\n-e POSTGRES_DB=\"ny_taxi\"\n-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n-p 5432:5432\npostgres:1\nTry adding winpty before the whole command\n3\nwin\nTry adding quotes:\n-v \"/c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \u201c/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"//c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\"\n-v \"c:\\some\\path\\ny_taxi_postgres_data\":/var/lib/postgresql/data\nNote: (Window) if it automatically creates a folder called \u201cny_taxi_postgres_data;C\u201d suggests you have problems with volume mapping, try deleting both folders and replacing \u201c-v\u201d part with other options. For me \u201c//c/\u201d works instead of \u201c/c/\u201d. And it will work by automatically creating a correct folder called \u201cny_taxi_postgres_data\u201d.\nA possible solution to this error would be to use /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data (with quotes\u2019 position varying as in the above list).\nYes for windows use the command it works perfectly fine\n-v /\u201d$(pwd)\u201d/ny_taxi_postgres_data:/var/lib/postgresql/data\nImportant: note how the quotes are placed.\nIf none of these options work, you can use a volume name instead of the path:\n-v ny_taxi_postgres_data:/var/lib/postgresql/data\nFor Mac: You can wrap $(pwd) with quotes like the highlighted.\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\nPostgres:13\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v \"$(pwd)\"/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSource:https://stackoverflow.com/questions/48522615/docker-error-invalid-reference-format-repository-name-must-be-lowercase",
"section": "Module 1: Docker and Terraform",
"question": "Docker - invalid reference format: repository name must be lowercase (Mounting volumes with Docker on Windows)"
},
{
"text": "Change the mounting path. Replace it with one of following:\n-v /e/zoomcamp/...:/var/lib/postgresql/data\n-v /c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data\\ (leading slash in front of c:)",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: invalid mode: \\Program Files\\Git\\var\\lib\\postgresql\\data."
},
{
"text": "When you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-v <your path>:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nThe error message above could happen. That means you should not mount on the second run. This command helped me:\nWhen you run this command second time\ndocker run -it \\\n-e POSTGRES_USER=\"root\" \\\n-e POSTGRES_PASSWORD=\"root\" \\\n-e POSTGRES_DB=\"ny_taxi\" \\\n-p 5432:5432 \\\npostgres:13",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: error while creating buildmount source path '/run/desktop/mnt/host/c/<your path>': mkdir /run/desktop/mnt/host/c: file exists"
},
{
"text": "This error appeared when running the command: docker build -t taxi_ingest:v001 .\nWhen feeding the database with the data the user id of the directory ny_taxi_postgres_data was changed to 999, so my user couldn\u2019t access it when running the above command. Even though this is not the problem here it helped to raise the error due to the permission issue.\nSince at this point we only need the files Dockerfile and ingest_data.py, to fix this error one can run the docker build command on a different directory (having only these two files).\nA more complete explanation can be found here: https://stackoverflow.com/questions/41286028/docker-build-error-checking-context-cant-stat-c-users-username-appdata\nYou can fix the problem by changing the permission of the directory on ubuntu with following command:\nsudo chown -R $USER dir_path\nOn windows follow the link: https://thegeekpage.com/take-ownership-of-a-file-folder-through-command-prompt-in-windows-10/ \n\n\t\t\t\t\t\t\t\t\t\t\tAdded by\n\t\t\t\t\t\t\t\t\t\t\tKenan Arslanbay",
"section": "Module 1: Docker and Terraform",
"question": "Docker - build error: error checking context: 'can't stat '/home/user/repos/data-engineering/week_1_basics_n_setup/2_docker_sql/ny_taxi_postgres_data''."
},
{
"text": "You might have installed docker via snap. Run \u201csudo snap status docker\u201d to verify.\nIf you have \u201cerror: unknown command \"status\", see 'snap help'.\u201d as a response than deinstall docker and install via the official website\nBind for 0.0.0.0:5432 failed: port is a",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ERRO[0000] error waiting for container: context canceled"
},
{
"text": "Found the issue in the PopOS linux. It happened because our user didn\u2019t have authorization rights to the host folder ( which also caused folder seems empty, but it didn\u2019t!).\n\u2705Solution:\nJust add permission for everyone to the corresponding folder\nsudo chmod -R 777 <path_to_folder>\nExample:\nsudo chmod -R 777 ny_taxi_postgres_data/",
"section": "Module 1: Docker and Terraform",
"question": "Docker - build error checking context: can\u2019t stat \u2018/home/fhrzn/Projects/\u2026./ny_taxi_postgres_data\u2019"
},
{
"text": "This happens on Ubuntu/Linux systems when trying to run the command to build the Docker container again.\n$ docker build -t taxi_ingest:v001 .\nA folder is created to host the Docker files. When the build command is executed again to rebuild the pipeline or create a new one the error is raised as there are no permissions on this new folder. Grant permissions by running this comtionmand;\n$ sudo chmod -R 755 ny_taxi_postgres_data\nOr use 777 if you still see problems. 755 grants write access to only the owner.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - failed to solve with frontend dockerfile.v0: failed to read dockerfile: error from sender: open ny_taxi_postgres_data: permission denied."
},
{
"text": "Get the network name via: $ docker network ls.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Docker network name"
},
{
"text": "Sometimes, when you try to restart a docker image configured with a network name, the above message appears. In this case, use the following command with the appropriate container name:\n>>> If the container is running state, use docker stop <container_name>\n>>> then, docker rm pg-database\nOr use docker start instead of docker run in order to restart the docker image without removing it.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Error response from daemon: Conflict. The container name \"pg-database\" is already in use by container \u201cxxx\u201d. You have to remove (or rename) that container to be able to reuse that name."
},
{
"text": "Typical error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name \"pgdatabase\" to address: Name or service not known\nWhen running docker-compose up -d see which network is created and use this for the ingestions script instead of pg-network and see the name of the database to use instead of pgdatabase\nE.g.:\npg-network becomes 2docker_default\nPgdatabase becomes 2docker-pgdatabase-1",
"section": "Module 1: Docker and Terraform",
"question": "Docker - ingestion when using docker-compose could not translate host name"
},
{
"text": "terraformRun this command before starting your VM:\nOn Intel CPU:\nmodprobe -r kvm_intel\nmodprobe kvm_intel nested=1\nOn AMD CPU:\nmodprobe -r kvm_amd\nmodprobe kvm_amd nested=1",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Cannot install docker on MacOS/Windows 11 VM running on top of Linux (due to Nested virtualization)."
},
{
"text": "It\u2019s very easy to manage your docker container, images, network and compose projects from VS Code.\nJust install the official extension and launch it from the left side icon.\nIt will work even if your Docker runs on WSL2, as VS Code can easily connect with your Linux.\nDocker - How to stop a container?\nUse the following command:\n$ docker stop <container_id>",
"section": "Module 1: Docker and Terraform",
"question": "Docker - Connecting from VS Code"
},
{
"text": "When you see this in logs, your container with postgres is not accepting any requests, so if you attempt to connect, you'll get this error:\nconnection failed: server closed the connection unexpectedly\nThis probably means the server terminated abnormally before or while processing the request.\nIn this case, you need to delete the directory with data (the one you map to the container with the -v flag) and restart the container.",
"section": "Module 1: Docker and Terraform",
"question": "Docker - PostgreSQL Database directory appears to contain a database. Database system is shut down"
},
{
"text": "On few versions of Ubuntu, snap command can be used to install Docker.\nsudo snap install docker",
"section": "Module 1: Docker and Terraform",
"question": "Docker not installable on Ubuntu"
},
{
"text": "error: could not change permissions of directory \"/var/lib/postgresql/data\": Operation not permitted volume\nif you have used the prev answer (just before this) and have created a local docker volume, then you need to tell the compose file about the named volume:\nvolumes:\ndtc_postgres_volume_local: # Define the named volume here\n# services mentioned in the compose file auto become part of the same network!\nservices:\nyour remaining code here . . .\nnow use docker volume inspect dtc_postgres_volume_local to see the location by checking the value of Mountpoint\nIn my case, after i ran docker compose up the mounting dir created was named \u2018docker_sql_dtc_postgres_volume_local\u2019 whereas it should have used the already existing \u2018dtc_postgres_volume_local\u2019\nAll i did to fix this is that I renamed the existing \u2018dtc_postgres_volume_local\u2019 to \u2018docker_sql_dtc_postgres_volume_local\u2019 and removed the newly created one (just be careful when doing this)\nrun docker compose up again and check if the table is there or not!",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - mounting error"
},
{
"text": "Couldn\u2019t translate host name to address\nMake sure postgres database is running.\n\n\u200b\u200bUse the command to start containers in detached mode: docker-compose up -d\n(data-engineering-zoomcamp) hw % docker compose up -d\n[+] Running 2/2\n\u283f Container pg-admin Started 0.6s\n\u283f Container pg-database Started\nTo view the containers use: docker ps.\n(data-engineering-zoomcamp) hw % docker ps\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\nfaf05090972e postgres:13 \"docker-entrypoint.s\u2026\" 39 seconds ago Up 37 seconds 0.0.0.0:5432->5432/tcp pg-database\n6344dcecd58f dpage/pgadmin4 \"/entrypoint.sh\" 39 seconds ago Up 37 seconds 443/tcp, 0.0.0.0:8080->80/tcp pg-admin\nhw\nTo view logs for a container: docker logs <containerid>\n(data-engineering-zoomcamp) hw % docker logs faf05090972e\nPostgreSQL Database directory appears to contain a database; Skipping initialization\n2022-01-25 05:58:45.948 UTC [1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv4 address \"0.0.0.0\", port 5432\n2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv6 address \"::\", port 5432\n2022-01-25 05:58:45.954 UTC [1] LOG: listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\n2022-01-25 05:58:45.984 UTC [28] LOG: database system was interrupted; last known up at 2022-01-24 17:48:35 UTC\n2022-01-25 05:58:48.581 UTC [28] LOG: database system was not properly shut down; automatic recovery in\nprogress\n2022-01-25 05:58:48.602 UTC [28] LOG: redo starts at 0/872A5910\n2022-01-25 05:59:33.726 UTC [28] LOG: invalid record length at 0/98A3C160: wanted 24, got 0\n2022-01-25 05:59:33.726 UTC [28\n] LOG: redo done at 0/98A3C128\n2022-01-25 05:59:48.051 UTC [1] LOG: database system is ready to accept connections\nIf docker ps doesn\u2019t show pgdatabase running, run: docker ps -a\nThis should show all containers, either running or stopped.\nGet the container id for pgdatabase-1, and run",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error translating host name to address"
},
{
"text": "After executing `docker-compose up` - if you lose database data and are unable to successfully execute your Ingestion script (to re-populate your database) but receive the following error:\nsqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name /data_pgadmin:/var/lib/pgadmin\"pg-database\" to address: Name or service not known\nDocker compose is creating its own default network since it is no longer specified in a docker execution command or file. Docker Compose will emit to logs the new network name. See the logs after executing `docker compose up` to find the network name and change the network name argument in your Ingestion script.\nIf problems persist with pgcli, we can use HeidiSQL,usql\nKrishna Anand",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Data retention (could not translate host name \"pg-database\" to address: Name or service not known)"
},
{
"text": "It returns --> Error response from daemon: network 66ae65944d643fdebbc89bd0329f1409dec2c9e12248052f5f4c4be7d1bdc6a3 not found\nTry:\ndocker ps -a to see all the stopped & running containers\nd to nuke all the containers\nTry: docker-compose up -d again ports\nOn localhost:8080 server \u2192 Unable to connect to server: could not translate host name 'pg-database' to address: Name does not resolve\nTry: new host name, best without \u201c - \u201d e.g. pgdatabase\nAnd on docker-compose.yml, should specify docker network & specify the same network in both containers\nservices:\npgdatabase:\nimage: postgres:13\nenvironment:\n- POSTGRES_USER=root\n- POSTGRES_PASSWORD=root\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"./ny_taxi_postgres_data:/var/lib/postgresql/data:rw\"\nports:\n- \"5431:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Hostname does not resolve"
},
{
"text": "So one common issue is when you run docker-compose on GCP, postgres won\u2019t persist it\u2019s data to mentioned path for example:\nservices:\n\u2026\n\u2026\npgadmin:\n\u2026\n\u2026\nVolumes:\n\u201c./pgadmin\u201d:/var/lib/pgadmin:wr\u201d\nMight not work so in this use you can use Docker Volume to make it persist, by simply changing\nservices:\n\u2026\n\u2026.\npgadmin:\n\u2026\n\u2026\nVolumes:\npgadmin:/var/lib/pgadmin\nvolumes:\nPgadmin:",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Persist PGAdmin docker contents on GCP"
},
{
"text": "The docker will keep on crashing continuously\nNot working after restart\ndocker engine stopped\nAnd failed to fetch extensions pop ups will on screen non-stop\nSolution :\nTry checking if latest version of docker is installed / Try updating the docker\nIf Problem still persist then final solution is to reinstall docker\n(Just have to fetch images again else no issues)",
"section": "Module 1: Docker and Terraform",
"question": "Docker engine stopped_failed to fetch extensions"
},
{
"text": "As per the lessons,\nPersisting pgAdmin configuration (i.e. server name) is done by adding a \u201cvolumes\u201d section:\nservices:\npgdatabase:\n[...]\npgadmin:\nimage: dpage/pgadmin4\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=root\nvolumes:\n- \"./pgAdmin_data:/var/lib/pgadmin/sessions:rw\"\nports:\n- \"8080:80\"\nIn the example above, \u201dpgAdmin_data\u201d is a folder on the host machine, and \u201c/var/lib/pgadmin/sessions\u201d is the session settings folder in the pgAdmin container.\nBefore running docker-compose up on the YAML file, we also need to give the pgAdmin container access to write to the \u201cpgAdmin_data\u201d folder. The container runs with a username called \u201c5050\u201d and user group \u201c5050\u201d. The bash command to give access over the mounted volume is:\nsudo chown -R 5050:5050 pgAdmin_data",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Persist PGAdmin configuration"
},
{
"text": "This happens if you did not create the docker group and added your user. Follow these steps from the link:\nguides/docker-without-sudo.md at main \u00b7 sindresorhus/guides \u00b7 GitHub\nAnd then press ctrl+D to log-out and log-in again. pgAdmin: Maintain state so that it remembers your previous connection\nIf you are tired of having to setup your database connection each time that you fire up the containers, all you have to do is create a volume for pgAdmin:\nIn your docker-compose.yaml file, enter the following into your pgAdmin declaration:\nvolumes:\n- type: volume\nsource: pgadmin_data\ntarget: /var/lib/pgadmin\nAlso add the following to the end of the file:ls\nvolumes:\nPgadmin_data:",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - dial unix /var/run/docker.sock: connect: permission denied"
},
{
"text": "This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - docker-compose still not available after changing .bashrc"
},
{
"text": "Installing pass via \u2018sudo apt install pass\u2019 helped to solve the issue. More about this can be found here: https://github.com/moby/buildkit/issues/1078",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error getting credentials after running docker-compose up -d"
},
{
"text": "For everyone who's having problem with Docker compose, getting the data in postgres and similar issues, please take care of the following:\ncreate a new volume on docker (either using the command line or docker desktop app)\nmake the following changes to your docker-compose.yml file (see attachment)\nset low_memory=false when importing the csv file (df = pd.read_csv('yellow_tripdata_2021-01.csv', nrows=1000, low_memory=False))\nuse the below function (in the upload-data.ipynb) for better tracking of your ingestion process (see attachment)\nOrder of execution:\n(1) open terminal in 2_docker_sql folder and run docker compose up\n(2) ensure no other containers are running except the one you just executed (pgadmin and pgdatabase)\n(3) open jupyter notebook and begin the data ingestion\n(4) open pgadmin and set up a server (make sure you use the same configurations as your docker-compose.yml file like the same name (pgdatabase), port, databasename (ny_taxi) etc.",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Errors pertaining to docker-compose.yml and pgadmin setup"
},
{
"text": "Locate config.json file for docker (check your home directory; Users/username/.docker).\nModify credsStore to credStore\nSave and re-run",
"section": "Module 1: Docker and Terraform",
"question": "Docker Compose up -d error getting credentials - err: exec: \"docker-credential-desktop\": executable file not found in %PATH%, out: ``"
},
{
"text": "To figure out which docker-compose you need to download from https://github.com/docker/compose/releases you can check your system with these commands:\nuname -s -> return Linux most likely\nuname -m -> return \"flavor\"\nOr try this command -\nsudo curl -L \"https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)\" -o /usr/local/bin/docker-compose",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Which docker-compose binary to use for WSL?"
},
{
"text": "If you wrote the docker-compose.yaml file exactly like the video, you might run into an error like this:dev\nservice \"pgdatabase\" refers to undefined volume dtc_postgres_volume_local: invalid compose project\nIn order to make it work, you need to include the volume in your docker-compose file. Just add the following:\nvolumes:\ndtc_postgres_volume_local:\n(Make sure volumes are at the same level as services.)",
"section": "Module 1: Docker and Terraform",
"question": "Docker-Compose - Error undefined volume in Windows/WSL"
},
{
"text": "Error: initdb: error: could not change permissions of directory\nIssue: WSL and Windows do not manage permissions in the same way causing conflict if using the Windows file system rather than the WSL file system.\nSolution: Use Docker volumes.\nWhy: Volume is used for storage of persistent data and not for use of transferring files. A local volume is unnecessary.\nBenefit: This resolves permission issues and allows for better management of volumes.\nNOTE: the \u2018user:\u2019 is not necessary if using docker volumes, but is if using local drive.\n</> docker-compose.yaml\nservices:\npostgres:\nimage: postgres:15-alpine\ncontainer_name: postgres\nuser: \"0:0\"\nenvironment:\n- POSTGRES_USER=postgres\n- POSTGRES_PASSWORD=postgres\n- POSTGRES_DB=ny_taxi\nvolumes:\n- \"pg-data:/var/lib/postgresql/data\"\nports:\n- \"5432:5432\"\nnetworks:\n- pg-network\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin\nuser: \"${UID}:${GID}\"\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=email@some-site.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\nvolumes:\n- \"pg-admin:/var/lib/pgadmin\"\nports:\n- \"8080:80\"\nnetworks:\n- pg-network\nnetworks:\npg-network:\nname: pg-network\nvolumes:\npg-data:\nname: ingest_pgdata\npg-admin:\nname: ingest_pgadmin",
"section": "Module 1: Docker and Terraform",
"question": "WSL Docker directory permissions error"
},
{
"text": "Cause : If Running on git bash or vm in windows pgadmin doesnt work easily LIbraries like psycopg2 and libpq ar required still the error persists.\nSolution- I use psql instead of pgadmin totally same\nPip install psycopg2\ndock",
"section": "Module 1: Docker and Terraform",
"question": "Docker - If pgadmin is not working for Querying in Postgres Use PSQL"
},
{
"text": "Cause:\nIt happens because the apps are not updated. To be specific, search for any pending updates for Windows Terminal, WSL and Windows Security updates.\nSolution\nfor updating Windows terminal which worked for me:\nGo to Microsoft Store.\nGo to the library of apps installed in your system.\nSearch for Windows terminal.\nUpdate the app and restart your system to see the changes.\nFor updating the Windows security updates:\nGo to Windows updates and check if there are any pending updates from Windows, especially security updates.\nDo restart your system once the updates are downloaded and installed successfully.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Insufficient system resources exist to complete the requested service."
},
{
"text": "Up restardoting the same issue appears. Happens out of the blue on windows.\nSolution 1: Fixing DNS Issue (credit: reddit) this worked for me personally\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"4\" /f\nRestart your computer and then enable it with the following\nreg add \"HKLM\\System\\CurrentControlSet\\Services\\Dnscache\" /v \"Start\" /t REG_DWORD /d \"2\" /f\nRestart your OS again. It should work.\nSolution 2: right click on running Docker icon (next to clock) and chose \"Switch to Linux containers\"\nbash: conda: command not found\nDatabase is uninitialized and superuser password is not specified.\nDatabase is uninitialized and superuser password is not specified.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - WSL integration with distro Ubuntu unexpectedly stopped with exit code 1."
},
{
"text": "Issue when trying to run the GPC VM through SSH through WSL2, probably because WSL2 isn\u2019t looking for .ssh keys in the correct folder. My case I was trying to run this command in the terminal and getting an error\nPC:/mnt/c/Users/User/.ssh$ ssh -i gpc [username]@[my external IP]\nYou can try to use sudo before the command\nSudo .ssh$ ssh -i gpc [username]@[my external IP]\nYou can also try to cd to your folder and change the permissions for the private key SSH file.\nchmod 600 gpc\nIf that doesn\u2019t work, create a .ssh folder in the home diretory of WSL2 and copy the content of windows .ssh folder to that new folder.\ncd ~\nmkdir .ssh\ncp -r /mnt/c/Users/YourUsername/.ssh/* ~/.ssh/\nYou might need to adjust the permissions of the files and folders in the .ssh directory.",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Permissions too open at Windows"
},
{
"text": "Such as the issue above, WSL2 may not be referencing the correct .ssh/config path from Windows. You can create a config file at the home directory of WSL2.\ncd ~\nmkdir .ssh\nCreate a config file in this new .ssh/ folder referencing this folder:\nHostName [GPC VM external IP]\nUser [username]\nIdentityFile ~/.ssh/[private key]",
"section": "Module 1: Docker and Terraform",
"question": "WSL - Could not resolve host name"
},
{
"text": "Change TO Socket\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi\npgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - connection failed: :1), port 5432 failed: could not receive data from server: Connection refused could not send SSL negotiation packet: Connection refused"
},
{
"text": "probably some installation error, check out sy",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI --help error"
},
{
"text": "In this section of the course, the 5432 port of pgsql is mapped to your computer\u2019s 5432 port. Which means you can access the postgres database via pgcli directly from your computer.\nSo No, you don\u2019t need to run it inside another container. Your local system will do.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - INKhould we run pgcli inside another docker container?"
},
{
"text": "FATAL: password authentication failed for user \"root\"\nobservations: Below in bold do not forget the folder that was created ny_taxi_postgres_data\nThis happens if you have a local Postgres installation in your computer. To mitigate this, use a different port, like 5431, when creating the docker container, as in: -p 5431: 5432\nThen, we need to use this port when connecting to pgcli, as shown below:\npgcli -h localhost -p 5431 -u root -d ny_taxi\nThis will connect you to your postgres docker container, which is mapped to your host\u2019s 5431 port (though you might choose any port of your liking as long as it is not occupied).\nFor a more visual and detailed explanation, feel free to check the video 1.4.2 - Port Mapping and Networks in Docker\nIf you want to debug: the following can help (on a MacOS)\nTo find out if something is blocking your port (on a MacOS):\nYou can use the lsof command to find out which application is using a specific port on your local machine. `lsof -i :5432`wi\nOr list the running postgres services on your local machine with launchctl\nTo unload the running service on your local machine (on a MacOS):\nunload the launch agent for the PostgreSQL service, which will stop the service and free up the port \n`launchctl unload -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nthis one to start it again\n`launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`\nChanging port from 5432:5432 to 5431:5432 helped me to avoid this error.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - FATAL: password authentication failed for user \"root\" (You already have Postgres)"
},
{
"text": "I get this error\npgcli -h localhost -p 5432 -U root -d ny_taxi\nTraceback (most recent call last):\nFile \"/opt/anaconda3/bin/pgcli\", line 8, in <module>\nsys.exit(cli())\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1128, in __call__\nreturn self.main(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/sitYe-packages/click/core.py\", line\n1053, in main\nrv = self.invoke(ctx)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 1395, in invoke\nreturn ctx.invoke(self.callback, **ctx.params)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/click/core.py\", line 754, in invoke\nreturn __callback(*args, **kwargs)\nFile \"/opt/anaconda3/lib/python3.9/site-packages/pgcli/main.py\", line 880, in cli\nos.makedirs(config_dir)\nFile \"/opt/anaconda3/lib/python3.9/os.py\", line 225, in makedirspython\nmkdir(name, mode)PermissionError: [Errno 13] Permission denied: '/Users/vray/.config/pgcli'\nMake sure you install pgcli without sudo.\nThe recommended approach is to use conda/anaconda to make sure your system python is not affected.\nIf conda install gets stuck at \"Solving environment\" try these alternatives: https://stackoverflow.com/questions/63734508/stuck-at-solving-environment-on-anaconda",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - PermissionError: [Errno 13] Permission denied: '/some/path/.config/pgcli'"
},
{
"text": "ImportError: no pq wrapper available.\nAttempts made:\n- couldn't import \\dt\nopg 'c' implementation: No module named 'psycopg_c'\n- couldn't import psycopg 'binary' implementation: No module named 'psycopg_binary'\n- couldn't import psycopg 'python' implementation: libpq library not found\nSolution:\nFirst, make sure your Python is set to 3.9, at least.\nAnd the reason for that is we have had cases of 'psycopg2-binary' failing to install because of an old version of Python (3.7.3). \n\n0. You can check your current python version with: \n$ python -V(the V must be capital)\n1. Based on the previous output, if you've got a 3.9, skip to Step #2\n Otherwispye better off with a new environment with 3.9\n$ conda create \u2013name de-zoomcamp python=3.9\n$ conda activate de-zoomcamp\n2. Next, you should be able to install the lib for postgres like this:\n```\n$ e\n$ pip install psycopg2_binary\n```\n3. Finally, make sure you're also installing pgcli, but use conda for that:\n```\n$ pgcli -h localhost -U root -d ny_taxisudo\n```\nThere, you should be good to go now!\nAnother solution:\nRun this\npip install \"psycopg[binary,pool]\"",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - no pq wrapper available."
},
{
"text": "If your Bash prompt is stuck on the password command for postgres\nUse winpty:\nwinpty pgcli -h localhost -p 5432 -u root -d ny_taxi\nAlternatively, try using Windows terminal or terminal in VS code.\nEditPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nThe error above was faced continually despite inputting the correct password\nSolution\nOption 1: Stop the PostgreSQL service on Windows\nOption 2 (using WSL): Completely uninstall Protgres 12 from Windows and install postgresql-client on WSL (sudo apt install postgresql-client-common postgresql-client libpq-dev)\nOption 3: Change the port of the docker container\nNEW SOLUTION: 27/01/2024\nPGCLI -connection failed: FATAL: password authentication failed for user \"root\"\nIf you\u2019ve got the error above, it\u2019s probably because you were just like me, closed the connection to the Postgres:13 image in the previous step of the tutorial, which is\n\ndocker run -it \\\n-e POSTGRES_USER=root \\\n-e POSTGRES_PASSWORD=root \\\n-e POSTGRES_DB=ny_taxi \\\n-v d:/git/data-engineering-zoomcamp/week_1/docker_sql/ny_taxi_postgres_data:/var/lib/postgresql/data \\\n-p 5432:5432 \\\npostgres:13\nSo keep the database connected and you will be able to implement all the next steps of the tutorial.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - stuck on password prompt"
},
{
"text": "Problem: If you have already installed pgcli but bash doesn't recognize pgcli\nOn Git bash: bash: pgcli: command not found\nOn Windows Terminal: pgcli: The term 'pgcli' is not recognized\u2026\nSolution: Try adding a Python path C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts to Windows PATH\nFor details:\nGet the location: pip list -v\nCopy C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\site-packages\n3. Replace site-packages with Scripts: C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts\nIt can also be that you have Python installed elsewhere.\nFor me it was under c:\\python310\\lib\\site-packages\nSo I had to add c:\\python310\\lib\\Scripts to PATH, as shown below.\nPut the above path in \"Path\" (or \"PATH\") in System Variables\nReference: https://stackoverflow.com/a/68233660",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - pgcli: command not found"
},
{
"text": "In case running pgcli locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\nBelow the usage with values used in the videos of the course for:\nnetwork name (docker network)\npostgres related variables for pgcli\nHostname\nUsername\nPort\nDatabase name\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\nPassword for root:\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\nVersion: 4.0.1\nHome: http://pgcli.com\nroot@pg-database:ny_taxi> \\dt\n+--------+------------------+-------+-------+\n| Schema | Name | Type | Owner |\n|--------+------------------+-------+-------|\n| public | yellow_taxi_data | table | root |\n+--------+------------------+-------+-------+\nSELECT 1\nTime: 0.009s\nroot@pg-database:ny_taxi>",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - running in a Docker container"
},
{
"text": "PULocationID will not be recognized but \u201cPULocationID\u201d will be. This is because unquoted \"Localidentifiers are case insensitive. See docs.",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - case sensitive use \u201cQuotations\u201d around columns with capital letters"
},
{
"text": "When using the command `\\d <database name>` you get the error column `c.relhasoids does not exist`.\nResolution:\nUninstall pgcli\nReinstall pgclidatabase \"ny_taxi\" does not exist\nRestart pc",
"section": "Module 1: Docker and Terraform",
"question": "PGCLI - error column c.relhasoids does not exist"
},
{
"text": "This happens while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThe port 5432 was taken by another postgres. We are not connecting to the port in docker, but to the port on our machine. Substitute 5431 or whatever port you mapped to for port 5432.\nAlso if this error is still persistent , kindly check if you have a service in windows running postgres , Stopping that service will resolve the issue",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: password authentication failed for user \"root\""
},
{
"text": "Can happen when connecting via pgcli\npgcli -h localhost -p 5432 -U root -d ny_taxi\nOr while uploading data via the connection in jupyter notebook\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')\nThis can happen when Postgres is already installed on your computer. Changing the port can resolve that (e.g. from 5432 to 5431).\nTo check whether there even is a root user with the ability to login:\nTry: docker exec -it <your_container_name> /bin/bash\nAnd then run\n???\nAlso, you could change port from 5432:5432 to 5431:5432\nOther solution that worked:\nChanging `POSTGRES_USER=juroot` to `PGUSER=postgres`\nBased on this: postgres with docker compose gives FATAL: role \"root\" does not exist error - Stack Overflow\nAlso `docker compose down`, removing folder that had postgres volume, running `docker compose up` again.",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: role \"root\" does not exist"
},
{
"text": "~\\anaconda3\\lib\\site-packages\\psycopg2\\__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)\n120\n121 dsn = _ext.make_dsn(dsn, **kwargs)\n--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\n123 if cursor_factory is not None:\n124 conn.cursor_factory = cursor_factory\nOperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: database \"ny_taxi\" does not exist\nMake sure postgres is running. You can check that by running `docker ps`\n\u2705Solution: If you have postgres software installed on your computer before now, build your instance on a different port like 8080 instead of 5432",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5432 failed: FATAL: dodatabase \"ny_taxi\" does not exist"
},
{
"text": "Issue:\ne\u2026\nSolution:\npip install psycopg2-binary\nIf you already have it, you might need to update it:\npip install psycopg2-binary --upgrade\nOther methods, if the above fails:\nif you are getting the \u201c ModuleNotFoundError: No module named 'psycopg2' \u201c error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e\nFirst uninstall the psycopg package\nThen update conda or pip\nThen install psycopg again using pip.\nif you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - ModuleNotFoundError: No module named 'psycopg2'"
},
{
"text": "In the join queries, if we mention the column name directly or enclosed in single quotes it\u2019ll throw an error says \u201ccolumn does not exist\u201d.\n\u2705Solution: But if we enclose the column names in double quotes then it will work",
"section": "Module 1: Docker and Terraform",
"question": "Postgres - \"Column does not exist\" but it actually does (Pyscopg2 error in MacBook Pro M2)"
},
{
"text": "pgAdmin has a new version. Create server dialog may not appear. Try using register-> server instead.",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Create server dialog does not appear"
},
{
"text": "Using GitHub Codespaces in the browser resulted in a blank screen after the login to pgAdmin (running in a Docker container). The terminal of the pgAdmin container was showing the following error message:\nCSRFError: 400 Bad Request: The referrer does not match the host.\nSolution #1:\nAs recommended in the following issue https://github.com/pgadmin-org/pgadmin4/issues/5432 setting the following environment variable solved it.\nPGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\"\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-p \"8080:80\" \\\n--name pgadmin \\\n--network=pg-network \\\ndpage/pgadmin4:8.2\nSolution #2:\nUsing the local installed VSCode to display GitHub Codespaces.\nWhen using GitHub Codespaces in the locally installed VSCode (opening a Codespace or creating/starting one) this issue did not occur.",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Blank/white screen after login (browser)"
},
{
"text": "I am using a Mac Pro device and connect to the GCP Compute Engine via Remote SSH - VSCode. But when I trying to run the PgAdmin container via docker run or docker compose command, I am failed to access the pgAdmin address via my browser. I have switched to another browser, but still can not access the pgAdmin address. So I modified a little bit the configuration from the previous DE Zoomcamp repository like below and can access the pgAdmin address:\nSolution #1:\nModified \u201cdocker run\u201d command\ndocker run --rm -it \\\n-e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n-e PGADMIN_DEFAULT_PASSWORD=\"pgadmin\" \\\n-e PGADMIN_CONFIG_WTF_CSRF_ENABLED=\"False\" \\\n-e PGADMIN_LISTEN_ADDRESS=0.0.0.0 \\\n-e PGADMIN_LISTEN_PORT=5050 \\\n-p 5050:5050 \\\n--network=de-zoomcamp-network \\\n--name pgadmin-container \\\n--link postgres-container \\\n-t dpage/pgadmin4\nSolution #2:\nModified docker-compose.yaml configuration (via \u201cdocker compose up\u201d command)\npgadmin:\nimage: dpage/pgadmin4\ncontainer_name: pgadmin-conntainer\nenvironment:\n- PGADMIN_DEFAULT_EMAIL=admin@admin.com\n- PGADMIN_DEFAULT_PASSWORD=pgadmin\n- PGADMIN_CONFIG_WTF_CSRF_ENABLED=False\n- PGADMIN_LISTEN_ADDRESS=0.0.0.0\n- PGADMIN_LISTEN_PORT=5050\nvolumes:\n- \"./pgadmin_data:/var/lib/pgadmin/data\"\nports:\n- \"5050:5050\"\nnetworks:\n- de-zoomcamp-network\ndepends_on:\n- postgres-conntainer\nPython - ModuleNotFoundError: No module named 'pysqlite2'\nImportError: DLL load failed while importing _sqlite3: The specified module could not be found. ModuleNotFoundError: No module named 'pysqlite2'\nThe issue seems to arise from the missing of sqlite3.dll in path \".\\Anaconda\\Dlls\\\".\n\u2705I solved it by simply copying that .dll file from \\Anaconda3\\Library\\bin and put it under the path mentioned above. (if you are using anaconda)",
"section": "Module 1: Docker and Terraform",
"question": "pgAdmin - Can not access/open the PgAdmin address via browser"
},
{
"text": "If you follow the video 1.2.2 - Ingesting NY Taxi Data to Postgres and you execute all the same\nsteps as Alexey does, you will ingest all the data (~1.3 million rows) into the table yellow_taxi_data as expected.\nHowever, if you try to run the whole script in the Jupyter notebook for a second time from top to bottom, you will be missing the first chunk of 100000 records. This is because there is a call to the iterator before the while loop that puts the data in the table. The while loop therefore starts by ingesting the second chunk, not the first.\n\u2705Solution: remove the cell \u201cdf=next(df_iter)\u201d that appears higher up in the notebook than the while loop. The first time w(df_iter) is called should be within the while loop.\n\ud83d\udcd4Note: As this notebook is just used as a way to test the code, it was not intended to be run top to bottom, and the logic is tidied up in a later step when it is instead inserted into a .py file for the pipeline",
"section": "Module 1: Docker and Terraform",
"question": "Python - Ingestion with Jupyter notebook - missing 100000 records"
},
{
"text": "{t_end - t_start} seconds\")\nimport pandas as pd\ndf = pd.read_csv('path/to/file.csv.gz', /app/ingest_data.py:1: DeprecationWarning:)\nIf you prefer to keep the uncompressed csv (easier preview in vscode and similar), gzip files can be unzipped using gunzip (but not unzip). On a Ubuntu local or virtual machine, you may need to apt-get install gunzip first.",
"section": "Module 1: Docker and Terraform",
"question": "Python - Iteration csv without error"
},
{
"text": "Pandas can interpret \u201cstring\u201d column values as \u201cdatetime\u201d directly when reading the CSV file using \u201cpd.read_csv\u201d using the parameter \u201cparse_dates\u201d, which for example can contain a list of column names or column indices. Then the conversion afterwards is not required anymore.\npandas.read_csv \u2014 pandas 2.1.4 documentation (pydata.org)\nExample from week 1\nimport pandas as pd\ndf = pd.read_csv(\n'yellow_tripdata_2021-01.csv',\nnrows=100,\nparse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\ndf.info()\nwhich will output\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 100 entries, 0 to 99\nData columns (total 18 columns):\n# Column Non-Null Count Dtype\n--- ------ -------------- -----\n0 VendorID 100 non-null int64\n1 tpep_pickup_datetime 100 non-null datetime64[ns]\n2 tpep_dropoff_datetime 100 non-null datetime64[ns]\n3 passenger_count 100 non-null int64\n4 trip_distance 100 non-null float64\n5 RatecodeID 100 non-null int64\n6 store_and_fwd_flag 100 non-null object\n7 PULocationID 100 non-null int64\n8 DOLocationID 100 non-null int64\n9 payment_type 100 non-null int64\n10 fare_amount 100 non-null float64\n11 extra 100 non-null float64\n12 mta_tax 100 non-null float64\n13 tip_amount 100 non-null float64\n14 tolls_amount 100 non-null float64\n15 improvement_surcharge 100 non-null float64\n16 total_amount 100 non-null float64\n17 congestion_surcharge 100 non-null float64\ndtypes: datetime64[ns](2), float64(9), int64(6), object(1)\nmemory usage: 14.2+ KB",
"section": "Module 1: Docker and Terraform",
"question": "iPython - Pandas parsing dates with \u2018read_csv\u2019"
},
{
"text": "os.system(f\"curl -LO {url} -o {csv_name}\")",
"section": "Module 1: Docker and Terraform",
"question": "Python - Python cant ingest data from the github link provided using curl"
},
{
"text": "When a CSV file is compressed using Gzip, it is saved with a \".csv.gz\" file extension. This file type is also known as a Gzip compressed CSV file. When you want to read a Gzip compressed CSV file using Pandas, you can use the read_csv() function, which is specifically designed to read CSV files. The read_csv() function accepts several parameters, including a file path or a file-like object. To read a Gzip compressed CSV file, you can pass the file path of the \".csv.gz\" file as an argument to the read_csv() function.\nHere is an example of how to read a Gzip compressed CSV file using Pandas:\ndf = pd.read_csv('file.csv.gz'\n, compression='gzip'\n, low_memory=False\n)",
"section": "Module 1: Docker and Terraform",
"question": "Python - Pandas can read *.csv.gzip"
},
{
"text": "Contrary to panda\u2019s read_csv method there\u2019s no such easy way to iterate through and set chunksize for parquet files. We can use PyArrow (Apache Arrow Python bindings) to resolve that.\nimport pyarrow.parquet as pq\noutput_name = \u201chttps://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet\u201d\nparquet_file = pq.ParquetFile(output_name)\nparquet_size = parquet_file.metadata.num_rows\nengine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')\ntable_name=\u201dyellow_taxi_schema\u201d\n# Clear table if exists\npq.read_table(output_name).to_pandas().head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')\n# default (and max) batch size\nindex = 65536\nfor i in parquet_file.iter_batches(use_threads=True):\nt_start = time()\nprint(f'Ingesting {index} out of {parquet_size} rows ({index / parquet_size:.0%})')\ni.to_pandas().to_sql(name=table_name, con=engine, if_exists='append')\nindex += 65536\nt_end = time()\nprint(f'\\t- it took %.1f seconds' % (t_end - t_start))",
"section": "Module 1: Docker and Terraform",
"question": "Python - How to iterate through and ingest parquet file"
},
{
"text": "Error raised during the jupyter notebook\u2019s cell execution:\nfrom sqlalchemy import create_engine.\nSolution: Version of Python module \u201ctyping_extensions\u201d >= 4.6.0. Can be updated by Conda or pip.",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLAlchemy - ImportError: cannot import name 'TypeAliasType' from 'typing_extensions'."
},
{
"text": "create_engine('postgresql://root:root@localhost:5432/ny_taxi') I get the error \"TypeError: 'module' object is not callable\"\nSolution:\nconn_string = \"postgresql+psycopg://root:root@localhost:5432/ny_taxi\"\nengine = create_engine(conn_string)",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLALchemy - TypeError 'module' object is not callable"
},
{
"text": "Error raised during the jupyter notebook\u2019s cell execution:\nengine = create_engine('postgresql://root:root@localhost:5432/ny_taxi').\nSolution: Need to install Python module \u201cpsycopg2\u201d. Can be installed by Conda or pip.",
"section": "Module 1: Docker and Terraform",
"question": "Python - SQLAlchemy - ModuleNotFoundError: No module named 'psycopg2'."
},
{
"text": "Unable to add Google Cloud SDK PATH to Windows\nWindows error: The installer is unable to automatically update your system PATH. Please add C:\\tools\\google-cloud-sdk\\bin\nif you are constantly getting this feedback. Might be that you needed to add Gitbash to your Windows path:\nOne way of doing that is to use conda: \u2018If you are not already using it\nDownload the Anaconda Navigator\nMake sure to check the box (add conda to the path when installing navigator: although not recommended do it anyway)\nYou might also need to install git bash if you are not already using it(or you might need to uninstall it to reinstall it properly)\nMake sure to check the following boxes while you install Gitbash\nAdd a GitBash to Windows Terminal\nUse Git and optional Unix tools from the command prompt\nNow open up git bash and type conda init bash This should modify your bash profile\nAdditionally, you might want to use Gitbash as your default terminal.\nOpen your Windows terminal and go to settings, on the default profile change Windows power shell to git bash",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Unable to add Google Cloud SDK PATH to Windows"
},
{
"text": "It asked me to create a project. This should be done from the cloud console. So maybe we don\u2019t need this FAQ.\nWARNING: Project creation failed: HttpError accessing <https://cloudresourcemanager.googleapis.com/v1/projects?alt=json>: response: <{'vtpep_pickup_datetimeary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Mon, 24 Jan 2022 19:29:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'server-timing': 'gfet4t7; dur=189', 'alt-svc': 'h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"', 'transfer-encoding': 'chunked', 'status': 409}>, content <{\n\"error\": {\n\"code\": 409,\n\"message\": \"Requested entity alreadytpep_pickup_datetime exists\",\n\"status\": \"ALREADY_EXISTS\"\n}\n}\nFrom Stackoverflow: https://stackoverflow.com/questions/52561383/gcloud-cli-cannot-create-project-the-project-id-you-specified-is-already-in-us?rq=1\nProject IDs are unique across all projects. That means if any user ever had a project with that ID, you cannot use it. testproject is pretty common, so it's not surprising it's already taken.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Project creation failed: HttpError accessing \u2026 Requested entity alreadytpep_pickup_datetime exists"
},
{
"text": "If you receive the error: \u201cError 403: The project to be billed is associated with an absent billing account., accountDisabled\u201d It is most likely because you did not enter YOUR project ID. The snip below is from video 1.3.2\nThe value you enter here will be unique to each student. You can find this value on your GCP Dashboard when you login.\nAshish Agrawal\nAnother possibility is that you have not linked your billing account to your current project",
"section": "Module 1: Docker and Terraform",
"question": "GCP - The project to be billed is associated with an absent billing account"
},
{
"text": "GCP Account Suspension Inquiry\nIf Google refuses your credit/debit card, try another - I\u2019ve got an issue with Kaspi (Kazakhstan) but it worked with TBC (Georgia).\nUnfortunately, there\u2019s small hope that support will help.\nIt seems that Pyypl web-card should work too.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - OR-CBAT-15 ERROR Google cloud free trial account"
},
{
"text": "The ny-rides.json is your private file in Google Cloud Platform (GCP). \n\nAnd here\u2019s the way to find it:\nGCP -> Select project with your instance -> IAM & Admin -> Service Accounts Keys tab -> add key, JSON as key type, then click create\nNote: Once you go into Service Accounts Keys tab, click the email, then you can see the \u201cKEYS\u201d tab where you can add key as a JSON as its key type",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Where can I find the \u201cny-rides.json\u201d file?"
},
{
"text": "In this lecture, Alexey deleted his instance in Google Cloud. Do I have to do it?\nNope. Do not delete your instance in Google Cloud platform. Otherwise, you have to do this twice for the week 1 readings.",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Do I need to delete my instance in Google Cloud?"
},
{
"text": "System Resource Usage:\ntop or htop: Shows real-time information about system resource usage, including CPU, memory, and processes.\nfree -h: Displays information about system memory usage and availability.\ndf -h: Shows disk space usage of file systems.\ndu -h <directory>: Displays disk usage of a specific directory.\nRunning Processes:\nps aux: Lists all running processes along with detailed information.\nNetwork:\nifconfig or ip addr show: Shows network interface configuration.\nnetstat -tuln: Displays active network connections and listening ports.\nHardware Information:\nlscpu: Displays CPU information.\nlsblk: Lists block devices (disks and partitions).\nlshw: Lists hardware configuration.\nUser and Permissions:\nwho: Shows who is logged on and their activities.\nw: Displays information about currently logged-in users and their processes.\nPackage Management:\napt list --installed: Lists installed packages (for Ubuntu and Debian-based systems)",
"section": "Module 1: Docker and Terraform",
"question": "Commands to inspect the health of your VM:"
},
{
"text": "if you\u2019ve got the error\n\u2502 Error: Error updating Dataset \"projects/<your-project-id>/datasets/demo_dataset\": googleapi: Error 403: Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. The default table expiration time must be less than 60 days, billingNotEnabled\nbut you\u2019ve set your billing account indeed, then try to disable billing for the project and enable it again. It worked for ME!",
"section": "Module 1: Docker and Terraform",
"question": "Billing account has not been enabled for this project. But you\u2019ve done it indeed!"
},
{
"text": "for windows if you having trouble install SDK try follow these steps on the link, if you getting this error:\nThese credentials will be used by any library that requests Application Default Credentials (ADC).\nWARNING:\nCannot find a quota project to add to ADC. You might receive a \"quota exceeded\" or \"API not enabled\" error. Run $ gcloud auth application-default set-quota-project to add a quota project.\nFor me:\nI reinstalled the sdk using unzip file \u201cinstall.bat\u201d,\nafter successfully checking gcloud version,\nrun gcloud init to set up project before\nyou run gcloud auth application-default login\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/1_terraform_gcp/windows.md\nGCP VM - I cannot get my Virtual Machine to start because GCP has no resources.\nClick on your VM\nCreate an image of your VM\nOn the page of the image, tell GCP to create a new VM instance via the image\nOn the settings page, change the location",
"section": "Module 1: Docker and Terraform",
"question": "GCP - Windows Google Cloud SDK install issue:gcp"
},
{
"text": "The reason this video about the GCP VM exists is that many students had problems configuring their env. You can use your own env if it works for you.\nAnd the advantage of using your own environment is that if you are working in a Github repo where you can commit, you will be able to commit the changes that you do. In the VM the repo is cloned via HTTPS so it is not possible to directly commit, even if you are the owner of the repo.",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Is it necessary to use a GCP VM? When is it useful?"
},
{
"text": "I am trying to create a directory but it won't let me do it\nUser1@DESKTOP-PD6UM8A MINGW64 /\n$ mkdir .ssh\nmkdir: cannot create directory \u2018.ssh\u2019: Permission denied\nYou should do it in your home directory. Should be your home (~)\nLocal. But it seems you're trying to do it in the root folder (/). Should be your home (~)\nLink to Video 1.4.1",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - mkdir: cannot create directory \u2018.ssh\u2019: Permission denied"
},
{
"text": "Failed to save '<file>': Unable to write file 'vscode-remote://ssh-remote+de-zoomcamp/home/<user>/data_engineering_course/week_2/airflow/dags/<file>' (NoPermissions (FileSystemError): Error: EACCES: permission denied, open '/home/<user>/data_engineering_course/week_2/airflow/dags/<file>')\nYou need to change the owner of the files you are trying to edit via VS Code. You can run the following command to change the ownership.\nssh\nsudo chown -R <user> <path to your directory>",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Error while saving the file in VM via VS Code"
},
{
"text": "Question: I connected to my VM perfectly fine last week (ssh) but when I tried again this week, the connection request keeps timing out.\n\u2705Answer: Start your VM. Once the VM is running, copy its External IP and paste that into your config file within the ~/.ssh folder.\ncd ~/.ssh\ncode config \u2190 this opens the config file in VSCode",
"section": "Module 1: Docker and Terraform",
"question": ". GCP VM - VM connection request timeout"
},
{
"text": "(reference: https://serverfault.com/questions/953290/google-compute-engine-ssh-connect-to-host-ip-port-22-operation-timed-out)Go to edit your VM.\nGo to section Automation\nAdd Startup script\n```\n#!/bin/bash\nsudo ufw allow ssh\n```\nStop and Start VM.",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - connect to host port 22 no route to host"
},
{
"text": "You can easily forward the ports of pgAdmin, postgres and Jupyter Notebook using the built-in tools in Ubuntu and without any additional client:\nFirst, in the VM machine, launch docker-compose up -d and jupyter notebook in the correct folder.\nFrom the local machine, execute: ssh -i ~/.ssh/gcp -L 5432:localhost:5432 username@external_ip_of_vm\nExecute the same command but with ports 8080 and 8888.\nNow you can access pgAdmin on local machine in browser typing localhost:8080\nFor Jupyter Notebook, type localhost:8888 in the browser of your local machine. If you have problems with the credentials, it is possible that you have to copy the link with the access token provided in the logs of the terminal of the VM machine when you launched the jupyter notebook command.\nTo forward both pgAdmin and postgres use, ssh -i ~/.ssh/gcp -L 5432:localhost:5432 -L 8080:localhost:8080 modito@35.197.218.128",
"section": "Module 1: Docker and Terraform",
"question": "GCP VM - Port forwarding from GCP without using VS Code"
},
{
"text": "If you are using MS VS Code and running gcloud in WSL2, when you first try to login to gcp via the gcloud cli gcloud auth application-default login, you will see a message like this, and nothing will happen\nAnd there might be a prompt to ask if you want to open it via browser, if you click on it, it will open up a page with error message\nSolution : you should instead hover on the long link, and ctrl + click the long link\n\nClick configure Trusted Domains here\n\nPopup will appear, pick first or second entry\nNext time you gcloud auth, the login page should popup via default browser without issues",
"section": "Module 1: Docker and Terraform",
"question": "GCP gcloud + MS VS Code - gcloud auth hangs"
},
{
"text": "It is an internet connectivity error, terraform is somehow not able to access the online registry. Check your VPN/Firewall settings (or just clear cookies or restart your network). Try terraform init again after this, it should work.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error: Failed to query available provider packages \u2502 Could not retrieve the list of available versions for provider hashicorp/google: could not query \u2502 provider registry for registry.terrafogorm.io/hashicorp/google: the request failed after 2 attempts, \u2502 please try again later"
},
{
"text": "The issue was with the network. Google is not accessible in my country, I am using a VPN. And The terminal program does not automatically follow the system proxy and requires separate proxy configuration settings.I opened a Enhanced Mode in Clash, which is a VPN app, and 'terraform apply' works! So if you encounter the same issue, you can ask help for your vpn provider.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error:Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=coherent-ascent-379901\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 172.217.163.42:443: i/o timeout"
},
{
"text": "https://techcommunity.microsoft.com/t5/azure-developer-community-blog/configuring-terraform-on-windows-10-linux-sub-system/ba-p/393845",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Install for WSL"
},
{
"text": "https://github.com/hashicorp/terraform/issues/14513",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error acquiring the state lock"
},
{
"text": "When running\nterraform apply\non wsl2 I've got this error:\n\u2502 Error: Post \"https://storage.googleapis.com/storage/v1/b?alt=json&prettyPrint=false&project=<your-project-id>\": oauth2: cannot fetch token: 400 Bad Request\n\u2502 Response: {\"error\":\"invalid_grant\",\"error_description\":\"Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.\"}\nIT happens because there may be time desync on your machine which affects computing JWT\nTo fix this, run the command\nsudo hwclock -s\nwhich fixes your system time.\nReference",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error 400 Bad Request. Invalid JWT Token on WSL."
},
{
"text": "\u2502 Error: googleapi: Error 403: Access denied., forbidden\nYour $GOOGLE_APPLICATION_CREDENTIALS might not be pointing to the correct file \nrun = export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/YOUR_JSON.json\nAnd then = gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error 403 : Access denied"
},
{
"text": "One service account is enough for all the services/resources you'll use in this course. After you get the file with your credentials and set your environment variable, you should be good to go.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Do I need to make another service account for terraform before I get the keys (.json file)?"
},
{
"text": "Here: https://releases.hashicorp.com/terraform/1.1.3/terraform_1.1.3_linux_amd64.zip",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Where can I find the Terraform 1.1.3 Linux (AMD 64)?"
},
{
"text": "You get this error because I run the command terraform init outside the working directory, and this is wrong.You need first to navigate to the working directory that contains terraform configuration files, and and then run the command.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Terraform initialized in an empty directory! The directory has no Terraform configuration files. You may begin working with Terraform immediately by creating Terraform configuration files.g"
},
{
"text": "The error:\nError: googleapi: Error 403: Access denied., forbidden\n\u2502\nand\n\u2502 Error: Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes.\nFor this solution make sure to run:\necho $GOOGLE_APPLICATION_CREDENTIALS\necho $?\nSolution:\nYou have to set again the GOOGLE_APPLICATION_CREDENTIALS as Alexey did in the environment set-up video in week1:\nexport GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error creating Dataset: googleapi: Error 403: Request had insufficient authentication scopes"
},
{
"text": "The error:\nError: googleapi: Error 403: terraform-trans-campus@trans-campus-410115.iam.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project. Permission 'storage.buckets.create' denied on resource (or it may not exist)., forbidden\nThe solution:\nYou have to declare the project name as your Project ID, and not your Project name, available on GCP console Dashboard.",
"section": "Module 1: Docker and Terraform",
"question": "Terraform - Error creating Bucket: googleapi: Error 403: Permission denied to access \u2018storage.buckets.create\u2019"
},
{
"text": "provider \"google\" {\nproject = var.projectId\ncredentials = file(\"${var.gcpkey}\")\n#region = var.region\nzone = var.zone\n}",
"section": "Module 1: Docker and Terraform",
"question": "To ensure the sensitivity of the credentials file, I had to spend lot of time to input that as a file."
},
{
"text": "For the HW1 I encountered this issue. The solution is\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria Zone';\nI think columns which start with uppercase need to go between \u201cColumn\u201d. I ran into a lot of issues like this and \u201c \u201d made it work out.\nAddition to the above point, for me, there is no \u2018Astoria Zone\u2019, only \u2018Astoria\u2019 is existing in the dataset.\nSELECT * FROM zones AS z WHERE z.\"Zone\" = 'Astoria\u2019;",
"section": "Module 1: Docker and Terraform",
"question": "SQL - SELECT * FROM zones_taxi WHERE Zone='Astoria Zone'; Error Column Zone doesn't exist"
},
{
"text": "It is inconvenient to use quotation marks all the time, so it is better to put the data to the database all in lowercase, so in Pandas after\ndf = pd.read_csv(\u2018taxi+_zone_lookup.csv\u2019)\nAdd the row:\ndf.columns = df.columns.str.lower()",
"section": "Module 1: Docker and Terraform",
"question": "SQL - SELECT Zone FROM taxi_zones Error Column Zone doesn't exist"
},
{
"text": "Solution (for mac users): os.system(f\"curl {url} --output {csv_name}\")",
"section": "Module 1: Docker and Terraform",
"question": "CURL - curl: (6) Could not resolve host: output.csv"
},
{
"text": "To resolve this, ensure that your config file is in C/User/Username/.ssh/config",
"section": "Module 1: Docker and Terraform",
"question": "SSH Error: ssh: Could not resolve hostname linux: Name or service not known"
},
{
"text": "If you use Anaconda (recommended for the course), it comes with pip, so the issues is probably that the anaconda\u2019s Python is not on the PATH.\nAdding it to the PATH is different for each operation system.\nFor Linux and MacOS:\nOpen a terminal.\nFind the path to your Anaconda installation. This is typically `~/anaconda3` or `~/opt/anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/path/to/anaconda3/bin:$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` (Linux) or `.bash_profile` (MacOS) file.\nOn Windows, python and pip are in different locations (python is in the anaconda root, and pip is in Scripts). With GitBash:\nLocate your Anaconda installation. The default path is usually `C:\\Users\\[YourUsername]\\Anaconda3`.\nDetermine the correct path format for Git Bash. Paths in Git Bash follow the Unix-style, so convert the Windows path to a Unix-style path. For example, `C:\\Users\\[YourUsername]\\Anaconda3` becomes `/c/Users/[YourUsername]/Anaconda3`.\nAdd Anaconda to your PATH with the command: `export PATH=\"/c/Users/[YourUsername]/Anaconda3/:/c/Users/[YourUsername]/Anaconda3/Scripts/$PATH\"`.\nTo make this change permanent, add the command to your `.bashrc` file in your home directory.\nRefresh your environment with the command: `source ~/.bashrc`.\nFor Windows (without Git Bash):\nRight-click on 'This PC' or 'My Computer' and select 'Properties'.\nClick on 'Advanced system settings'.\nIn the System Properties window, click on 'Environment Variables'.\nIn the Environment Variables window, select the 'Path' variable in the 'System variables' section and click 'Edit'.\nIn the Edit Environment Variable window, click 'New' and add the path to your Anaconda installation (typically `C:\\Users\\[YourUsername]\\Anaconda3` and C:\\Users\\[YourUsername]\\Anaconda3\\Scripts`).\nClick 'OK' in all windows to apply the changes.\nAfter adding Anaconda to the PATH, you should be able to use `pip` from the command line. Remember to restart your terminal (or command prompt in Windows) to apply these changes.",
"section": "Module 1: Docker and Terraform",
"question": "'pip' is not recognized as an internal or external command, operable program or batch file."
},
{
"text": "Resolution: You need to stop the services which is using the port.\nRun the following:\n```\nsudo kill -9 `sudo lsof -t -i:<port>`\n```\n<port> being 8080 in this case. This will free up the port for use.\n~ Abhijit Chakraborty\nError: error response from daemon: cannot stop container: 1afaf8f7d52277318b71eef8f7a7f238c777045e769dd832426219d6c4b8dfb4: permission denied\nResolution: In my case, I had to stop docker and restart the service to get it running properly\nUse the following command:\n```\nsudo systemctl restart docker.socket docker.service\n```\n~ Abhijit Chakraborty\nError: cannot import module psycopg2\nResolution: Run the following command in linux:\n```\nsudo apt-get install libpq-dev\npip install psycopg2\n```\n~ Abhijit Chakraborty\nError: docker build Error checking context: 'can't stat '<path-to-file>'\nResolution: This happens due to insufficient permission for docker to access a certain file within the directory which hosts the Dockerfile.\n1. You can create a .dockerignore file and add the directory/file which you want Dockerfile to ignore while build.\n2. If the above does not work, then put the dockerfile and corresponding script, `\t1.py` in our case to a subfolder. and run `docker build ...`\nfrom inside the new folder.\n~ Abhijit Chakraborty",
"section": "Module 1: Docker and Terraform",
"question": "Error: error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use"
},
{
"text": "To get a pip-friendly requirements.txt file file from Anaconda use\nconda install pip then `pip list \u2013format=freeze > requirements.txt`.\n`conda list -d > requirements.txt` will not work and `pip freeze > requirements.txt` may give odd pathing.",
"section": "Module 2: Workflow Orchestration",
"question": "Anaconda to PIP"
},
{
"text": "Prefect: https://docs.google.com/document/d/1K_LJ9RhAORQk3z4Qf_tfGQCDbu8wUWzru62IUscgiGU/edit?usp=sharing\nAirflow: https://docs.google.com/document/d/1-BwPAsyDH_mAsn8HH5z_eNYVyBMAtawJRjHHsjEKHyY/edit?usp=sharing",
"section": "Module 2: Workflow Orchestration",
"question": "Where are the FAQ questions from the previous cohorts for the orchestration module?"
},
{
"text": "Issue : Docker containers exit instantly with code 132, upon docker compose up\nMage documentation has it listing the cause as \"older architecture\" .\nThis might be a hardware issue, so unless you have another computer, you can't solve it without purchasing a new one, so the next best solution is a VM.\nThis is from a student running on a VirtualBox VM, Ubuntu 22.04.3 LTS, Docker version 25.0.2. So not having the context on how the vbox was spin up with (CPU, RAM, network, etc), it\u2019s really inconclusive at this time.",
"section": "Module 2: Workflow Orchestration",
"question": "Docker - 2.2.2 Configure Mage"
},
{
"text": "This issue was occurring with Windows WSL 2\nFor me this was because WSL 2 was not dedicating enough cpu cores to Docker.The load seems to take up at least one cpu core so I recommend dedicating at least two.\nOpen Bash and run the following code:\n$ cd ~\n$ ls -la\nLook for the .wsl config file:\n-rw-r--r-- 1 ~1049089 31 Jan 25 12:54 .wslconfig\nUsing a text editing tool of your choice edit or create your .wslconfig file:\n$ nano .wslconfig\nPaste the following into the new file/ edit the existing file in this format and save:\n*** Note - for memory\u2013 this is the RAM on your machine you can dedicate to Docker, your situation may be different than mine ***\n[wsl2]\nprocessors=<Number of Processors - at least 2!> example: 4\nmemory=<memory> example:4GB\nExample:\nOnce you do that run:\n$ wsl --shutdown\nThis shuts down WSL\nThen Restart Docker Desktop - You should now be able to load the .csv.gz file without the error into a pandas dataframe",
"section": "Module 2: Workflow Orchestration",
"question": "WSL - 2.2.3 Mage - Unexpected Kernel Restarts; Kernel Running out of memory:"
},
{
"text": "The issue and solution on the link:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1706817366764269?thread_ts=1706815324.993529&cid=C01FABYF2RG",
"section": "Module 2: Workflow Orchestration",
"question": "2.2.3 Configuring Postgres"
},
{
"text": "Check that the POSTGRES_PORT variable in the io_config.yml file is set to port 5432, which is the default postgres port. The POSTGRES_PORT variable is the mage container port, not the host port. Hence, there\u2019s no need to set the POSTGRES_PORT to 5431 just because you already have a conflicting postgres installation in your host machine.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE - 2.2.3 OperationalError: (psycopg2.OperationalError) connection to server at \"localhost\" (::1), port 5431 failed: Connection refused"
},
{
"text": "You forgot to select \u2018dev\u2019 profile in the dropdown menu next to where you select \u2018PostgreSQL\u2019 in the connection drop down.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE - 2.2.4 executing SELECT 1; results in KeyError"
},
{
"text": "If you are getting this error. Update your mage io_config.yaml file, and specify a timeout value set to 600 like this.\nMake sure to save your changes.\nMAGE - 2.2.4 Testing BigQuery connection using SQL 404 error:\nNotFound: 404 Not found: Dataset ny-rides-diegogutierrez:None was not found in location northamerica-northeast1\nIf you get this error even with all roles/permissions given to the service account check if you have ticked the box where it says \u201cUse raw SQL\u201d, just like the image below.",
"section": "Module 2: Workflow Orchestration",
"question": "MAGE -2.2.4 ConnectionError: ('Connection aborted.', TimeoutError('The write operation timed out'))"
},
{
"text": "Solution: https://stackoverflow.com/questions/48056381/google-client-invalid-jwt-token-must-be-a-short-lived-token",
"section": "Module 2: Workflow Orchestration",
"question": "Problem: RefreshError: ('invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.'})"
},
{
"text": "Origin of Solution (Mage Slack-Channel): https://mageai.slack.com/archives/C03HTTWFEKE/p1706543947795599\nProblem: This error can often be seen after solving the error mentioned in 2.2.4. The error can be found in Mage version 0.9.61 and is a side-effect of the update of the code for data-loader blocks.\nNote: Mage 0.9.62 has been released, as of Feb 5 2024. Please recheck. Solution below may be obsolete\nSolution: Using a \u201cfixed\u201d version of the docker container\nPull updated docker image from docker-hub\nmageai/mageaidocker pull:alpha\nUpdate docker-compose.yaml\nversion: '3'\nservices:\nmagic:\nimage: mageai/mageai:alpha <--- instead of \u201clatest\u201d-tag\ndocker-compose up\nThe original Error is still present, but the SQL-query will return the desired result:\n--------------------------------------------------------------------------------------",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - 2.2.4 IndexError: list index out of range"
},
{
"text": "Add\nif not path.parent.is_dir():\npath.parent.mkdir(parents=True)\npath = Path(path).as_posix()\nsee:\nhttps://datatalks-club.slack.com/archives/C01FABYF2RG/p1675774214591809?thread_ts=1675768839.028879&cid=C01FABYF2RG",
"section": "Module 2: Workflow Orchestration",
"question": "2.2.6 OSError: Cannot save file into a non-existent directory: '..\\\\..\\\\data\\\\yellow'\\n\")"
},
{
"text": "The video DE Zoomcamp 2.2.7 is missing the actual deployment of Mage using Terraform to GCP. The steps for the deployment were not covered in the video.\nI successfully deployed it and wanted to share some key points:\nIn variables.tf, set the project_id default value to your GCP project ID.\nEnable the Cloud Filestore API:\nVisit the Google Cloud Console.to\nNavigate to \"APIs & Services\" > \"Library.\"\nSearch for \"Cloud Filestore API.\"\nClick on the API and enable it.\nTo perform the deployment:\nterraform init\nterraform apply\nPlease note that during the terraform apply step, Terraform will prompt you to enter the PostgreSQL password. After that, it will ask for confirmation to proceed with the deployment. Review the changes, type 'yes' when prompted, and press Enter.",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Deploying Mage to GCP"
},
{
"text": "If you want to rune multiple docker containers from different directories. Then make sure to change the port mappings in the docker-compose.yml file.\nports:\n- 8088:6789\nThe 8088 port in above case is hostport, where mage will run on your local machine. You can customize this as long as the port is available. If you are running on VM, make sure to forward the port too. You need to keep the container port to 6789 as this is the port where mage is running.\nGCP - 2.2.7d Deploying Mage to Google Cloud\nWhile terraforming all the resources inside a VM created in GCS the following error is shown.\nError log:\nmodule.lb-http.google_compute_backend_service.default[\"default\"]: Creating...\n\u2577\n\u2502 Error: Error creating GlobalAddress: googleapi: Error 403: Request had insufficient authentication scopes.\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"googleapis.com\",\n\u2502 \"metadatas\": {\n\u2502 \"method\": \"compute.beta.GlobalAddressesService.Insert\",\n\u2502 \"service\": \"compute.googleapis.com\"\n\u2502 },\n\u2502 \"reason\": \"ACCESS_TOKEN_SCOPE_INSUFFICIENT\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 More details:\n\u2502 Reason: insufficientPermissions, Message: Insufficient Permission\nThis error might happen when you are using a VM inside GCS. To use the Google APIs from a GCP virtual machine you need to add the cloud platform scope (\"https://www.googleapis.com/auth/cloud-platform\") to your VM when it is created.\nSince ours is already created you can just stop it and change the permissions. You can do it in the console, just go to \"EDIT\", g99o all the way down until you find \"Cloud API access scopes\". There you can \"Allow full access to all Cloud APIs\". I did this and all went smoothly generating all the resources needed. Hope it helps if you encounter this same error.\nResources: https://stackoverflow.com/questions/35928534/403-request-had-insufficient-authentication-scopes-during-gcloud-container-clu",
"section": "Module 2: Workflow Orchestration",
"question": "Ruuning Multiple Mage instances in Docker from different directories"
},
{
"text": "If you are on the free trial account on GCP you will face this issue when trying to deploy the infrastructures with terraform. This service is not available for this kind of account.\nThe solution I found was to delete the load_balancer.tf file and to comment or delete the rows that differentiate it on the main.tf file. After this just do terraform destroy to delete any infrastructure created on the fail attempts and re-run the terraform apply.\nCode on main.tf to comment/delete:\nLine 166, 167, 168",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Load Balancer Problem (Security Policies quota)"
},
{
"text": "If you get the following error\nYou have to edit variables.tf on the gcp folder, set your project-id and region and zones properly. Then, run terraform apply again.\nYou can find correct regions/zones here: https://cloud.google.com/compute/docs/regions-zones\nDeploying MAGE to GCP with Terraform via the VM (2.2.7)\nFYI - It can take up to 20 minutes to deploy the MAGE Terraform files if you are using a GCP Virtual Machine. It is normal, so don\u2019t interrupt the process or think it\u2019s taking too long. If you have, make sure you run a terraform destroy before trying again as you will have likely partially created resources which will cause errors next time you run `terraform apply`.\n`terraform destroy` may not completely delete partial resources - go to Google Cloud Console and use the search bar at the top to search for the \u2018app.name\u2019 you declared in your variables.tf file; this will list all resources with that name - make sure you delete them all before running `terraform apply` again.\nWhy are my GCP free credits going so fast? MAGE .tf files - Terraform Destroy not destroying all Resources\nI checked my GCP billing last night & the MAGE Terraform IaC didn't destroy a GCP Resource called Filestore as \u2018mage-data-prep- it has been costing \u00a35.01 of my free credits each day I now have \u00a3151 left - Alexey has assured me that This amount WILL BE SUFFICIENT funds to finish the course. Note to anyone who had issues deploying the MAGE terraform code: check your billing account to see what you're being charged for (main menu - billing) (even if it's your free credits) and run a search for 'mage-data-prep' in the top bar just to be sure that your resources have been destroyed - if any come up delete them.",
"section": "Module 2: Workflow Orchestration",
"question": "GCP - 2.2.7d Part 2 - Getting error when you run terraform apply"
},
{
"text": "```\n\u2502 Error: Error creating Connector: googleapi: Error 403: Permission 'vpcaccess.connectors.create' denied on resource '//vpcaccess.googleapis.com/projects/<ommit>/locations/us-west1' (or it may not exist).\n\u2502 Details:\n\u2502 [\n\u2502 {\n\u2502 \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n\u2502 \"domain\": \"vpcaccess.googleapis.com\",\n\u2502 \"metadata\": {\n\u2502 \"permission\": \"vpcaccess.connectors.create\",\n\u2502 \"resource\": \"projects/<ommit>/locations/us-west1\"\n\u2502 },\n\u2502 \"reason\": \"IAM_PERMISSION_DENIED\"\n\u2502 }\n\u2502 ]\n\u2502\n\u2502 with google_vpc_access_connector.connector,\n\u2502 on fs.tf line 19, in resource \"google_vpc_access_connector\" \"connector\":\n\u2502 19: resource \"google_vpc_access_connector\" \"connector\" {\n\u2502\n```\nSolution: Add Serverless VPC Access Admin to Service Account.\nLine 148",
"section": "Module 2: Workflow Orchestration",
"question": "Question: Permission 'vpcaccess.connectors.create'"
},
{
"text": "Git won\u2019t push an empty folder to GitHub, so if you put a file in that folder and then push, then you should be good to go.\nOr - in your code- make the folder if it doesn\u2019t exist using Pathlib as shown here: https://stackoverflow.com/a/273227/4590385.\nFor some reason, when using github storage, the relative path for writing locally no longer works. Try using two separate paths, one full path for the local write, and the original relative path for GCS bucket upload.",
"section": "Module 2: Workflow Orchestration",
"question": "File Path: Cannot save file into a non-existent directory: 'data/green'"
},
{
"text": "The green dataset contains lpep_pickup_datetime while the yellow contains tpep_pickup_datetime. Modify the script(s) depending on the dataset as required.",
"section": "Module 2: Workflow Orchestration",
"question": "No column name lpep_pickup_datetime / tpep_pickup_datetime"
},
{
"text": "pd.read_csv\ndf_iter = pd.read_csv(dataset_url, iterator=True, chunksize=100000)\nThe data needs to be appended to the parquet file using the fastparquet engine\ndf.to_parquet(path, compression=\"gzip\", engine='fastparquet', append=True)",
"section": "Module 2: Workflow Orchestration",
"question": "Process to download the VSC using Pandas is killed right away"
},
{
"text": "denied: requested access to the resource is denied\nThis can happen when you\nHaven't logged in properly to Docker Desktop (use docker login -u \"myusername\")\nHave used the wrong username when pushing to docker images. Use the same one as your username and as the one you build on\ndocker image build -t <myusername>/<imagename>:<tag>\ndocker image push <myusername>/<imagename>:<tag>",
"section": "Module 2: Workflow Orchestration",
"question": "Push to docker image failure"
},
{
"text": "16:21:35.607 | INFO | Flow run 'singing-malkoha' - Executing 'write_bq-b366772c-0' immediately...\nKilled\nSolution: You probably are running out of memory on your VM and need to add more. For example, if you have 8 gigs of RAM on your VM, you may want to expand that to 16 gigs.",
"section": "Module 2: Workflow Orchestration",
"question": "Flow script fails with \u201ckilled\u201d message:"
},
{
"text": "After playing around with prefect for a while this can happen.\nSsh to your VM and run sudo du -h --block-size=G | sort -n -r | head -n 30 to see which directory needs the most space.\nMost likely it will be \u2026/.prefect/storage, where your cached flows are stored. You can delete older flows from there. You also have to delete the corresponding flow in the UI, otherwise it will throw you an error, when you try to run your next flow.\nSSL Certificate Verify: (I got it when trying to run flows on MAC): urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]\npip install certifi\n/Applications/Python\\ {ver}/Install\\ Certificates.command\nor\nrunning the \u201cInstall Certificate.command\u201d inside of the python{ver} folder",
"section": "Module 2: Workflow Orchestration",
"question": "GCP VM: Disk Space is full"
},
{
"text": "It means your container consumed all available RAM allocated to it. It can happen in particular when working on Question#3 in the homework as the dataset is relatively large and containers eat a lot of memory in general.\nI would recommend restarting your computer and only starting the necessary processes to run the container. If that doesn\u2019t work, allocate more resources to docker. If also that doesn\u2019t work because your workstation is a potato, you can use an online compute environment service like GitPod, which is free under under 50 hours / month of use.",
"section": "Module 2: Workflow Orchestration",
"question": "Docker: container crashed with status code 137."
},
{
"text": "In Q3 there was a task to run the etl script from web to GCS. The problem was, it wasn\u2019t really an ETL straight from web to GCS, but it was actually a web to local storage to local memory to GCS over network ETL. Yellow data is about 100 MB each per month compressed and ~700 MB after uncompressed on memory\nThis leads to a problem where i either got a network type error because my not so good 3rd world internet or i got my WSL2 crashed/hanged because out of memory error and/or 100% resource usage hang.\nSolution:\nif you have a lot of time at hand, try compressing it to parquet and writing it to GCS with the timeout argument set to a really high number (the default os 60 seconds)\nthe yellow taxi data for feb 2019 is about 100MB as parquet file\ngcp_cloud_storage_bucket_block.upload_from_path(\nfrom_path=f\"{path}\",\nto_path=path,\ntimeout=600\n)",
"section": "Module 2: Workflow Orchestration",
"question": "Timeout due to slow upload internet"
},
{
"text": "This error occurs when you try to re-run the export block, of the transformed green_taxi data to PostgreSQL.\nWhat you\u2019ll need to do is to drop the table using SQL in Mage (screenshot below).\nYou should be able to re-run the block successfully after dropping the table.",
"section": "Module 2: Workflow Orchestration",
"question": "UndefinedColumn: column \"ratecode_id\", \"rate_code_id\" \u201cvendor_id\u201d, \u201cpu_location_id\u201d, \u201cdo_location_id\u201d of relation \"green_taxi\" does not exist - Export transformed green_taxi data to PostgreSQL"
},
{
"text": "SettingWithCopyWarning:\nA value is trying to be set on a copy of a slice from a DataFrame.\nUse the data.loc[] = value syntax instead of df[] = value to ensure that the new column is being assigned to the original dataframe instead of a copy of a dataframe or a series.",
"section": "Module 2: Workflow Orchestration",
"question": "Homework - Q3 SettingWithCopyWarning Error:"
},
{
"text": "CSV Files are very big in nyc data, so we instead of using Pandas/Python kernel , we can try Pyspark Kernel\nDocumentation of Mage for using pyspark kernel: https://docs.mage.ai/integrations/spark-pyspark\n?",
"section": "Module 2: Workflow Orchestration",
"question": "Since I was using slow laptop, and we have so big csv files, I used pyspark kernel in mage instead of python, How to do it?"
},
{
"text": "So we will first delete the connection between blocks then we can remove the connection.",
"section": "Module 2: Workflow Orchestration",
"question": "I got an error when I was deleting BLOCK IN A PIPELINE"
},
{
"text": "While Editing the Pipeline Name It throws permission denied error.\n(Work around)In that case proceed with the work and save later on revisit it will let you edit.",
"section": "Module 2: Workflow Orchestration",
"question": "Mage UI won\u2019t let you edit the Pipeline name?"
},
{
"text": "Solution n\u00b01 if you want to download everything :\n```\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom pyarrow.fs import GcsFileSystem\n\u2026\n@data_loader\ndef load_data(*args, **kwargs):\n bucket_name = YOUR_BUCKET_NAME_HERE'\n blob_prefix = 'PATH / TO / WHERE / THE / PARTITIONS / ARE'\n root_path = f\"{bucket_name}/{blob_prefix}\"\npa_table = pq.read_table(\n source=root_path,\n filesystem=GcsFileSystem(), \n )\n\n return pa_table.to_pandas()\nSolution n\u00b02 if you want to download only some dates :\n@data_loader\ndef load_data(*args, **kwargs):\ngcs = pa.fs.GcsFileSystem()\nbucket_name = 'YOUR_BUCKET_NAME_HERE'\nblob_prefix = ''PATH / TO / WHERE / THE / PARTITIONS / ARE''\nroot_path = f\"{bucket_name}/{blob_prefix}\"\npa_dataset = pq.ParquetDataset(\npath_or_paths=root_path,\nfilesystem=gcs,\nfilters=[('lpep_pickup_date', '>=', '2020-10-01'), ('lpep_pickup_date', '<=', '2020-10-31')]\n)\nreturn pa_dataset.read().to_pandas()\n# More information about the pq.Parquet.Dataset : Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories. Documentation here :\nhttps://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset\nERROR: UndefinedColumn: column \"vendor_id\" of relation \"green_taxi\" does not exist\nTwo possible solutions both of them work in the same way.\nOpen up a Data Loader connect using SQL - RUN the command \n`DROP TABLE mage.green_taxi`\nElse, Open up a Data Extractor of SQL - increase the rows to above the number of rows in the dataframe (you can find that in the bottom of the transformer block) change the Write Policy to `Replace` and run the SELECT statement",
"section": "Module 2: Workflow Orchestration",
"question": "How do I make Mage load the partitioned files that we created on 2.2.4, to load them into BigQuery ?"
},
{
"text": "All mage files are in your /home/src/folder where you saved your credentials.json so you should be able to access them locally. You will see a folder for \u2018Pipelines\u2019, 'data loaders', 'data transformers' & 'data exporters' - inside these will be the .py or .sql files for the blocks you created in your pipeline.\nRight click & \u2018download\u2019 the pipeline itself to your local machine (which gives you metadata, pycache and other files)\nAs above, download each .py/.sql file that corresponds to each block you created for the pipeline. You'll find these under 'data loaders', 'data transformers' 'data exporters'\nMove the downloaded files to your GitHub repo folder & commit your changes.",
"section": "Module 2: Workflow Orchestration",
"question": "Git - What Files Should I Submit for Homework 2 & How do I get them out of MAGE:"
},
{
"text": "Assuming you downloaded the Mage repo in the week 2 folder of the Data Engineering Zoomcamp, you might want to include your mage copy, demo pipelines and homework within your personal copy of the Data Engineering Zoomcamp repo. This will not work by default, because GitHub sees them as two separate repositories, and one does not track the other. To add the Mage files to your main DE Zoomcamp repo, you will need to:\nMove the contents of the .gitignore file in your main .gitignore.\nUse the terminal to cd into the Mage folder and:\nrun \u201cgit remote remove origin\u201d to de-couple the Mage repo,\nrun \u201crm -rf .git\u201d to delete local git files,\nrun \u201cgit add .\u201d to add the current folder as changes to stage, commit and push.",
"section": "Module 2: Workflow Orchestration",
"question": "Git - How do I include the files in the Mage repo (including exercise files and homework) in a personal copy of the Data Engineering Zoomcamp repo?"
},
{
"text": "When try to add three assertions:\nvendor_id is one of the existing values in the column (currently)\npassenger_count is greater than 0\ntrip_distance is greater than 0\nto test_output, I got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Below is my code:\ndata_filter = (data['passenger_count'] > 0) and (data['trip_distance'] > 0)\nAfter looking for solutions at Stackoverflow, I found great discussion about it. So I changed my code into:\ndata_filter = (data['passenger_count'] > 0) & (data['trip_distance'] > 0)",
"section": "Module 2: Workflow Orchestration",
"question": "Got ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
},
{
"text": "This happened when I just booted up my PC, continuing from the progress I was doing from yesterday.\nAfter cd-ing into your directory, and running docker compose up , the web interface for the Mage shows, but the files that I had yesterday was gone.\nIf your files are gone, go ahead and close the web interface, and properly shutting down the mage docker compose by doing Ctrl + C once. Try running it again. This worked for me more than once (yes the issue persisted with my PC twice)\nAlso, you should check if you\u2019re in the correct repository before doing docker compose up . This was discussed in the Slack #course-data-engineering channel",
"section": "Module 2: Workflow Orchestration",
"question": "Mage AI Files are Gone/disappearing"
},
{
"text": "The above errors due to \u201c at the trailing side and it need to be modified with \u2018 quotes at both ends\nKrishna Anand",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - Errors in io.config.yaml file"
},
{
"text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket using pyarrow suggesting Mage doesn\u2019t have the necessary permissions to access the specified GCP credentials .json file.\nArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetBucketMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: Cannot open credentials file /home/src/...\nSolution: Inside the Mage app:\nCreate a credentials folder (e.g. gcp-creds) within the magic-zoomcamp folder\nIn the credentials folder create a .json key file (e.g. mage-gcp-creds.json)\nCopy/paste GCP service account credentials into the .json key file and save\nUpdate code to point to this file. E.g.\nenviron['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/src/magic-zoomcamp/gcp-creds/mage-gcp-creds.json'",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - ArrowException Cannot open credentials file"
},
{
"text": "Oserror: google::cloud::status(unavailable: retry policy exhausted getbucketmetadata: could not create a OAuth2 access token to authenticate the request. the request was not sent, as such an access token is required to complete the request successfully. learn more about google cloud authentication at https://cloud.google.com/docs/authentication. the underlying error message was: performwork() - curl error [6]=couldn't resolve host name)",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - OSError"
},
{
"text": "Problem: The following error occurs when attempting to export data from Mage to a GCS bucket. Assigned service account doesn\u2019t have the necessary permissions access Google Cloud Storage Bucket\nPermissionError: [Errno 13] google::cloud::Status(PERMISSION_DENIED: Permanent error GetBucketMetadata:... .iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist). error_info={reason=forbidden, domain=global, metadata={http_status_code=403}}). Detail: [errno 13] Permission denied\nSolution: Add Cloud Storage Admin role to the service account:\nGo to project in Google Cloud Console>IAM & Admin>IAM\nClick Edit principal (pencil symbol) to the right of the service account you are using\nClick + ADD ANOTHER ROLE\nSelect Cloud Storage>Storage Admin\nClick Save",
"section": "Module 2: Workflow Orchestration",
"question": "Mage - PermissionError service account does not have storage.buckets.get access to the Google Cloud Storage bucket"
},
{
"text": "1. Make sure your pyspark script is ready to be send to Dataproc cluster\n2. Create a Dataproc Cluster in GCP Console\n3. Make sure to edit the service account and add new role - Dataproc Editor\n4. Copy the python script ./notebooks/pyspark_script.py and place it under GCS bucket path\n5. Make sure gcloud cli is installed either in Mage manually or via your Dockerfile and docker-compose files. This is needed to let Mage access google Dataproc and the script it needs to execute. Refer - Installing the latest gcloud CLI\n6. Use the Bigquery/Dataproc script mentioned here - https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/code/cloud.md . Use Mage to trigger the query",
"section": "Module 3: Data Warehousing",
"question": "Trigger Dataproc from Mage"
},
{